From e12ca48cd9a34b1f41b11f267bddf91f05dae5be Mon Sep 17 00:00:00 2001 From: Josh Hursey Date: Tue, 10 Aug 2010 20:51:11 +0000 Subject: [PATCH] A number of C/R enhancements per RFC below: http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413 --- contrib/Makefile.am | 6 +- contrib/amca-param-sets/ft-enable-cr | 3 +- contrib/amca-param-sets/ft-enable-cr-recovery | 82 + ompi/mca/bml/r2/bml_r2_ft.c | 16 +- ompi/mca/btl/mx/btl_mx.c | 2 +- ompi/mca/btl/openib/btl_openib.c | 2 +- ompi/mca/btl/sm/btl_sm.c | 22 +- ompi/mca/crcp/base/Makefile.am | 3 +- ompi/mca/crcp/base/base.h | 12 +- ompi/mca/crcp/base/crcp_base_fns.c | 37 +- ompi/mca/crcp/base/crcp_base_select.c | 6 +- ompi/mca/crcp/bkmrk/crcp_bkmrk.h | 8 +- ompi/mca/crcp/bkmrk/crcp_bkmrk_module.c | 34 +- ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c | 28 +- ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.h | 14 +- ompi/mca/crcp/crcp.h | 21 + ompi/mca/mpool/sm/mpool_sm_module.c | 5 +- ompi/mca/pml/bfo/pml_bfo.c | 4 +- ompi/mca/pml/csum/pml_csum.c | 4 +- ompi/mca/pml/ob1/pml_ob1.c | 4 +- ompi/mpiext/cr/Makefile.am | 38 + ompi/mpiext/cr/c/checkpoint.c | 88 + ompi/mpiext/cr/c/inc_register_callback.c | 39 + ompi/mpiext/cr/c/migrate.c | 120 + ompi/mpiext/cr/c/quiesce_checkpoint.c | 69 + ompi/mpiext/cr/c/quiesce_end.c | 74 + ompi/mpiext/cr/c/quiesce_start.c | 210 ++ ompi/mpiext/cr/c/restart.c | 66 + ompi/mpiext/cr/c/self_register_checkpoint.c | 39 + ompi/mpiext/cr/c/self_register_continue.c | 39 + ompi/mpiext/cr/c/self_register_restart.c | 39 + ompi/mpiext/cr/configure.m4 | 19 + ompi/mpiext/cr/configure.params | 12 + ompi/mpiext/cr/mpiext_cr_c.h | 82 + ompi/runtime/ompi_cr.c | 94 +- ompi/runtime/ompi_cr.h | 15 +- ompi/tools/ompi_info/components.c | 21 + ompi/tools/ompi_info/ompi_info.c | 6 + ompi/tools/ompi_info/param.c | 13 +- ompi/tools/ortetools/Makefile.am | 12 +- opal/config/opal_configure_options.m4 | 23 + opal/mca/compress/Makefile.am | 42 + opal/mca/compress/base/Makefile.am | 21 + opal/mca/compress/base/base.h | 76 + opal/mca/compress/base/compress_base_close.c | 40 + opal/mca/compress/base/compress_base_fns.c | 142 ++ opal/mca/compress/base/compress_base_open.c | 99 + opal/mca/compress/base/compress_base_select.c | 65 + .../compress/base/help-opal-compress-base.txt | 13 + opal/mca/compress/bzip/Makefile.am | 40 + opal/mca/compress/bzip/compress_bzip.h | 63 + .../compress/bzip/compress_bzip_component.c | 138 ++ opal/mca/compress/bzip/compress_bzip_module.c | 247 ++ opal/mca/compress/bzip/configure.params | 13 + .../compress/bzip/help-opal-compress-bzip.txt | 13 + opal/mca/compress/compress.h | 135 ++ opal/mca/compress/gzip/Makefile.am | 40 + opal/mca/compress/gzip/compress_gzip.h | 63 + .../compress/gzip/compress_gzip_component.c | 138 ++ opal/mca/compress/gzip/compress_gzip_module.c | 250 ++ opal/mca/compress/gzip/configure.params | 13 + .../compress/gzip/help-opal-compress-gzip.txt | 13 + opal/mca/crs/base/base.h | 43 +- opal/mca/crs/base/crs_base_close.c | 8 +- opal/mca/crs/base/crs_base_fns.c | 260 +- opal/mca/crs/base/crs_base_open.c | 22 +- opal/mca/crs/base/crs_base_select.c | 8 +- opal/mca/crs/blcr/configure.m4 | 8 + opal/mca/crs/blcr/crs_blcr_module.c | 576 ++--- opal/mca/crs/crs.h | 25 +- opal/mca/crs/none/crs_none_module.c | 59 +- opal/mca/crs/opal_crs.7in | 6 +- opal/mca/crs/self/crs_self_module.c | 58 +- opal/runtime/opal_cr.c | 345 ++- opal/runtime/opal_cr.h | 74 + opal/runtime/opal_finalize.c | 9 +- opal/runtime/opal_init.c | 22 +- opal/tools/opal-restart/help-opal-restart.txt | 9 +- opal/tools/opal-restart/opal-restart.c | 279 ++- orte/config/config_files.m4 | 4 + orte/mca/errmgr/autor/Makefile.am | 38 + orte/mca/errmgr/autor/configure.m4 | 20 + orte/mca/errmgr/autor/configure.params | 14 + orte/mca/errmgr/autor/errmgr_autor.h | 88 + .../mca/errmgr/autor/errmgr_autor_component.c | 161 ++ orte/mca/errmgr/autor/errmgr_autor_module.c | 1194 ++++++++++ .../errmgr/autor/help-orte-errmgr-autor.txt | 28 + orte/mca/errmgr/base/Makefile.am | 5 +- orte/mca/errmgr/base/base.h | 46 + orte/mca/errmgr/base/errmgr_base_fns.c | 487 +++- orte/mca/errmgr/base/errmgr_base_tool.c | 477 ++++ orte/mca/errmgr/base/errmgr_private.h | 9 +- orte/mca/errmgr/crmig/Makefile.am | 38 + orte/mca/errmgr/crmig/configure.m4 | 20 + orte/mca/errmgr/crmig/configure.params | 14 + orte/mca/errmgr/crmig/errmgr_crmig.h | 93 + .../mca/errmgr/crmig/errmgr_crmig_component.c | 142 ++ orte/mca/errmgr/crmig/errmgr_crmig_module.c | 1678 +++++++++++++ .../errmgr/crmig/help-orte-errmgr-crmig.txt | 27 + orte/mca/errmgr/errmgr.h | 79 +- orte/mca/errmgr/example/.ompi_ignore | 0 orte/mca/errmgr/example/Makefile.am | 38 + orte/mca/errmgr/example/configure.m4 | 20 + orte/mca/errmgr/example/configure.params | 14 + orte/mca/errmgr/example/errmgr_example.h | 74 + .../errmgr/example/errmgr_example_component.c | 120 + .../errmgr/example/errmgr_example_module.c | 187 ++ .../example/help-orte-errmgr-example.txt | 14 + orte/mca/errmgr/hnp/errmgr_hnp.c | 13 +- orte/mca/errmgr/orted/errmgr_orted.c | 17 +- orte/mca/ess/env/ess_env_module.c | 71 +- orte/mca/filem/rsh/filem_rsh.h | 3 +- orte/mca/filem/rsh/filem_rsh_component.c | 11 +- orte/mca/filem/rsh/filem_rsh_module.c | 90 +- orte/mca/odls/base/odls_base_default_fns.c | 38 +- orte/mca/plm/base/plm_base_launch_support.c | 11 +- orte/mca/plm/plm_types.h | 2 + orte/mca/rml/rml_types.h | 9 +- orte/mca/snapc/base/base.h | 36 +- orte/mca/snapc/base/snapc_base_close.c | 9 +- orte/mca/snapc/base/snapc_base_fns.c | 962 ++------ orte/mca/snapc/base/snapc_base_open.c | 101 +- orte/mca/snapc/base/snapc_base_select.c | 9 +- orte/mca/snapc/full/help-orte-snapc-full.txt | 8 +- orte/mca/snapc/full/snapc_full.h | 50 +- orte/mca/snapc/full/snapc_full_app.c | 1288 +++++++--- orte/mca/snapc/full/snapc_full_component.c | 26 +- orte/mca/snapc/full/snapc_full_global.c | 2030 ++++++++++------ orte/mca/snapc/full/snapc_full_local.c | 1294 ++++++---- orte/mca/snapc/full/snapc_full_module.c | 63 +- orte/mca/snapc/orte_snapc.7in | 8 +- orte/mca/snapc/snapc.h | 131 +- orte/mca/sstore/Makefile.am | 46 + orte/mca/sstore/base/Makefile.am | 26 + orte/mca/sstore/base/base.h | 147 ++ .../mca/sstore/base/help-orte-sstore-base.txt | 13 + orte/mca/sstore/base/sstore_base_close.c | 35 + orte/mca/sstore/base/sstore_base_fns.c | 956 ++++++++ orte/mca/sstore/base/sstore_base_open.c | 232 ++ orte/mca/sstore/base/sstore_base_select.c | 61 + orte/mca/sstore/central/Makefile.am | 40 + orte/mca/sstore/central/configure.m4 | 20 + orte/mca/sstore/central/configure.params | 13 + .../central/help-orte-sstore-central.txt | 19 + orte/mca/sstore/central/sstore_central.h | 127 + orte/mca/sstore/central/sstore_central_app.c | 742 ++++++ .../sstore/central/sstore_central_component.c | 115 + .../sstore/central/sstore_central_global.c | 1243 ++++++++++ .../mca/sstore/central/sstore_central_local.c | 995 ++++++++ .../sstore/central/sstore_central_module.c | 359 +++ orte/mca/sstore/orte_sstore.7in | 66 + orte/mca/sstore/sstore.h | 404 ++++ orte/mca/sstore/stage/Makefile.am | 40 + orte/mca/sstore/stage/configure.m4 | 20 + orte/mca/sstore/stage/configure.params | 13 + .../sstore/stage/help-orte-sstore-stage.txt | 26 + orte/mca/sstore/stage/sstore_stage.h | 145 ++ orte/mca/sstore/stage/sstore_stage_app.c | 723 ++++++ .../mca/sstore/stage/sstore_stage_component.c | 235 ++ orte/mca/sstore/stage/sstore_stage_global.c | 1763 ++++++++++++++ orte/mca/sstore/stage/sstore_stage_local.c | 2099 +++++++++++++++++ orte/mca/sstore/stage/sstore_stage_module.c | 373 +++ .../data_type_support/orte_dt_copy_fns.c | 8 +- .../data_type_support/orte_dt_packing_fns.c | 25 +- .../data_type_support/orte_dt_print_fns.c | 11 +- .../data_type_support/orte_dt_unpacking_fns.c | 22 +- orte/runtime/orte_cr.c | 32 +- orte/runtime/orte_cr.h | 9 +- orte/runtime/orte_globals.c | 10 + orte/runtime/orte_globals.h | 4 + orte/tools/Makefile.am | 8 +- .../orte-checkpoint/help-orte-checkpoint.txt | 9 + orte/tools/orte-checkpoint/orte-checkpoint.c | 201 +- orte/tools/orte-migrate/CMakeLists.txt | 37 + orte/tools/orte-migrate/Makefile.am | 42 + orte/tools/orte-migrate/help-orte-migrate.txt | 51 + orte/tools/orte-migrate/orte-migrate.1in | 81 + orte/tools/orte-migrate/orte-migrate.c | 768 ++++++ orte/tools/orte-restart/help-orte-restart.txt | 16 +- orte/tools/orte-restart/orte-restart.c | 414 ++-- orte/tools/orterun/orterun.c | 24 + orte/tools/orterun/orterun.h | 5 +- 182 files changed, 25548 insertions(+), 3640 deletions(-) create mode 100644 contrib/amca-param-sets/ft-enable-cr-recovery create mode 100644 ompi/mpiext/cr/Makefile.am create mode 100644 ompi/mpiext/cr/c/checkpoint.c create mode 100644 ompi/mpiext/cr/c/inc_register_callback.c create mode 100644 ompi/mpiext/cr/c/migrate.c create mode 100644 ompi/mpiext/cr/c/quiesce_checkpoint.c create mode 100644 ompi/mpiext/cr/c/quiesce_end.c create mode 100644 ompi/mpiext/cr/c/quiesce_start.c create mode 100644 ompi/mpiext/cr/c/restart.c create mode 100644 ompi/mpiext/cr/c/self_register_checkpoint.c create mode 100644 ompi/mpiext/cr/c/self_register_continue.c create mode 100644 ompi/mpiext/cr/c/self_register_restart.c create mode 100644 ompi/mpiext/cr/configure.m4 create mode 100644 ompi/mpiext/cr/configure.params create mode 100644 ompi/mpiext/cr/mpiext_cr_c.h create mode 100644 opal/mca/compress/Makefile.am create mode 100644 opal/mca/compress/base/Makefile.am create mode 100644 opal/mca/compress/base/base.h create mode 100644 opal/mca/compress/base/compress_base_close.c create mode 100644 opal/mca/compress/base/compress_base_fns.c create mode 100644 opal/mca/compress/base/compress_base_open.c create mode 100644 opal/mca/compress/base/compress_base_select.c create mode 100644 opal/mca/compress/base/help-opal-compress-base.txt create mode 100644 opal/mca/compress/bzip/Makefile.am create mode 100644 opal/mca/compress/bzip/compress_bzip.h create mode 100644 opal/mca/compress/bzip/compress_bzip_component.c create mode 100644 opal/mca/compress/bzip/compress_bzip_module.c create mode 100644 opal/mca/compress/bzip/configure.params create mode 100644 opal/mca/compress/bzip/help-opal-compress-bzip.txt create mode 100644 opal/mca/compress/compress.h create mode 100644 opal/mca/compress/gzip/Makefile.am create mode 100644 opal/mca/compress/gzip/compress_gzip.h create mode 100644 opal/mca/compress/gzip/compress_gzip_component.c create mode 100644 opal/mca/compress/gzip/compress_gzip_module.c create mode 100644 opal/mca/compress/gzip/configure.params create mode 100644 opal/mca/compress/gzip/help-opal-compress-gzip.txt create mode 100644 orte/mca/errmgr/autor/Makefile.am create mode 100644 orte/mca/errmgr/autor/configure.m4 create mode 100644 orte/mca/errmgr/autor/configure.params create mode 100644 orte/mca/errmgr/autor/errmgr_autor.h create mode 100644 orte/mca/errmgr/autor/errmgr_autor_component.c create mode 100644 orte/mca/errmgr/autor/errmgr_autor_module.c create mode 100644 orte/mca/errmgr/autor/help-orte-errmgr-autor.txt create mode 100644 orte/mca/errmgr/base/errmgr_base_tool.c create mode 100644 orte/mca/errmgr/crmig/Makefile.am create mode 100644 orte/mca/errmgr/crmig/configure.m4 create mode 100644 orte/mca/errmgr/crmig/configure.params create mode 100644 orte/mca/errmgr/crmig/errmgr_crmig.h create mode 100644 orte/mca/errmgr/crmig/errmgr_crmig_component.c create mode 100644 orte/mca/errmgr/crmig/errmgr_crmig_module.c create mode 100644 orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt create mode 100644 orte/mca/errmgr/example/.ompi_ignore create mode 100644 orte/mca/errmgr/example/Makefile.am create mode 100644 orte/mca/errmgr/example/configure.m4 create mode 100644 orte/mca/errmgr/example/configure.params create mode 100644 orte/mca/errmgr/example/errmgr_example.h create mode 100644 orte/mca/errmgr/example/errmgr_example_component.c create mode 100644 orte/mca/errmgr/example/errmgr_example_module.c create mode 100644 orte/mca/errmgr/example/help-orte-errmgr-example.txt create mode 100644 orte/mca/sstore/Makefile.am create mode 100644 orte/mca/sstore/base/Makefile.am create mode 100644 orte/mca/sstore/base/base.h create mode 100644 orte/mca/sstore/base/help-orte-sstore-base.txt create mode 100644 orte/mca/sstore/base/sstore_base_close.c create mode 100644 orte/mca/sstore/base/sstore_base_fns.c create mode 100644 orte/mca/sstore/base/sstore_base_open.c create mode 100644 orte/mca/sstore/base/sstore_base_select.c create mode 100644 orte/mca/sstore/central/Makefile.am create mode 100644 orte/mca/sstore/central/configure.m4 create mode 100644 orte/mca/sstore/central/configure.params create mode 100644 orte/mca/sstore/central/help-orte-sstore-central.txt create mode 100644 orte/mca/sstore/central/sstore_central.h create mode 100644 orte/mca/sstore/central/sstore_central_app.c create mode 100644 orte/mca/sstore/central/sstore_central_component.c create mode 100644 orte/mca/sstore/central/sstore_central_global.c create mode 100644 orte/mca/sstore/central/sstore_central_local.c create mode 100644 orte/mca/sstore/central/sstore_central_module.c create mode 100644 orte/mca/sstore/orte_sstore.7in create mode 100644 orte/mca/sstore/sstore.h create mode 100644 orte/mca/sstore/stage/Makefile.am create mode 100644 orte/mca/sstore/stage/configure.m4 create mode 100644 orte/mca/sstore/stage/configure.params create mode 100644 orte/mca/sstore/stage/help-orte-sstore-stage.txt create mode 100644 orte/mca/sstore/stage/sstore_stage.h create mode 100644 orte/mca/sstore/stage/sstore_stage_app.c create mode 100644 orte/mca/sstore/stage/sstore_stage_component.c create mode 100644 orte/mca/sstore/stage/sstore_stage_global.c create mode 100644 orte/mca/sstore/stage/sstore_stage_local.c create mode 100644 orte/mca/sstore/stage/sstore_stage_module.c create mode 100644 orte/tools/orte-migrate/CMakeLists.txt create mode 100644 orte/tools/orte-migrate/Makefile.am create mode 100644 orte/tools/orte-migrate/help-orte-migrate.txt create mode 100644 orte/tools/orte-migrate/orte-migrate.1in create mode 100644 orte/tools/orte-migrate/orte-migrate.c diff --git a/contrib/Makefile.am b/contrib/Makefile.am index 703a8bd2e4..5b3a84c4ca 100644 --- a/contrib/Makefile.am +++ b/contrib/Makefile.am @@ -1,5 +1,5 @@ # -# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # Copyright (c) 2004-2005 The University of Tennessee and The University @@ -22,7 +22,9 @@ amca_paramdir = $(AMCA_PARAM_SETS_DIR) dist_amca_param_DATA = amca-param-sets/example.conf if WANT_FT -dist_amca_param_DATA += amca-param-sets/ft-enable-cr +dist_amca_param_DATA += \ + amca-param-sets/ft-enable-cr \ + amca-param-sets/ft-enable-cr-recovery endif EXTRA_DIST = \ diff --git a/contrib/amca-param-sets/ft-enable-cr b/contrib/amca-param-sets/ft-enable-cr index 78b9273561..ea125f66a4 100644 --- a/contrib/amca-param-sets/ft-enable-cr +++ b/contrib/amca-param-sets/ft-enable-cr @@ -1,5 +1,5 @@ # -# Copyright (c) 2008-2009 The Trustees of Indiana University and Indiana +# Copyright (c) 2008-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # @@ -37,7 +37,6 @@ opal_cr_use_thread=1 # rml_wrapper=ftrm snapc=full -#filem=rsh # # OMPI Parameters diff --git a/contrib/amca-param-sets/ft-enable-cr-recovery b/contrib/amca-param-sets/ft-enable-cr-recovery new file mode 100644 index 0000000000..6a1c8a7309 --- /dev/null +++ b/contrib/amca-param-sets/ft-enable-cr-recovery @@ -0,0 +1,82 @@ +# +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# An Aggregate MCA Parameter Set to enable checkpoint/restart capabilities +# for a job. +# +# Usage: +# shell$ mpirun -am ft-enable-cr ./app +# + +# +# OPAL Parameters +# - Turn off OPAL only checkpointing +# - Select only checkpoint ready components +# - Enable Additional FT infrastructure +# - Auto-select OPAL CRS component +# - If available, use the FT Thread (Default) +# +opal_cr_allow_opal_only=0 +mca_base_component_distill_checkpoint_ready=1 +ft_cr_enabled=1 +crs= +opal_cr_use_thread=1 + +# +# ORTE Parameters +# - Wrap the RML +# - Use the 'full' Snapshot Coordinator +# - Use the 'cm' routed component. It is the only one that is currently able to +# handle process and daemon loss. +# +rml_wrapper=ftrm +snapc=full +routed=cm + +# +# OMPI Parameters +# - Wrap the PML +# - Use a Bookmark Exchange Fully Coordinated Checkpoint/Restart Coordination Protocol +# +pml_wrapper=crcpw +crcp=bkmrk + +# +# Temporary fix to force the event engine to use poll to behave well with BLCR +# +opal_event_include=poll + +# +# We currently only support the following options to the OpenIB BTL +# Future development will attempt to eliminate many of these restrictions +# +btl_openib_want_fork_support=1 +btl_openib_use_async_event_thread=0 +btl_openib_use_eager_rdma=0 +btl_openib_cpc_include=oob + +# Enable SIGTSTP/SIGCONT capability +# killall -TSTP mpirun +# killall -CONT mpirun +orte_forward_job_control=1 + +# +# Use the C/R Error Management and Recovery Service +# +orte_enable_recovery=1 +orte_max_global_restarts=10 +errmgr_crmig_enable=1 +errmgr_autor_enable=1 + +# +# Additional constraints to be lifted in the future +# +plm=rsh +rmaps=resilient diff --git a/ompi/mca/bml/r2/bml_r2_ft.c b/ompi/mca/bml/r2/bml_r2_ft.c index 728f88b779..dfe78d8db3 100644 --- a/ompi/mca/bml/r2/bml_r2_ft.c +++ b/ompi/mca/bml/r2/bml_r2_ft.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2007 The University of Tennessee and The University @@ -54,7 +54,7 @@ int mca_bml_r2_ft_event(int state) first_continue_pass = !first_continue_pass; /* Since nothing in Checkpoint, we are fine here (unless required by BTL) */ - if( ompi_cr_continue_like_restart && !first_continue_pass) { + if( orte_cr_continue_like_restart && !first_continue_pass) { procs = ompi_proc_all(&num_procs); if(NULL == procs) { return OMPI_ERR_OUT_OF_RESOURCE; @@ -136,7 +136,7 @@ int mca_bml_r2_ft_event(int state) } else if(OPAL_CRS_CONTINUE == state) { /* Matches OPAL_CRS_RESTART_PRE */ - if( ompi_cr_continue_like_restart && first_continue_pass) { + if( orte_cr_continue_like_restart && first_continue_pass) { if( OMPI_SUCCESS != (ret = mca_bml_r2_finalize()) ) { opal_output(0, "bml:r2: ft_event(Restart): Failed to finalize BML framework\n"); return ret; @@ -147,7 +147,7 @@ int mca_bml_r2_ft_event(int state) } } /* Matches OPAL_CRS_RESTART */ - else if( ompi_cr_continue_like_restart && !first_continue_pass ) { + else if( orte_cr_continue_like_restart && !first_continue_pass ) { /* * Barrier to make all processes have been successfully restarted before * we try to remove some restart only files. @@ -157,10 +157,6 @@ int mca_bml_r2_ft_event(int state) return ret; } - opal_output_verbose(10, ompi_cr_output, - "bml:r2: ft_event(Restart): Cleanup restart files\n"); - opal_crs_base_cleanup_flush(); - /* * Re-open the BTL framework to get the full list of components. */ @@ -234,10 +230,6 @@ int mca_bml_r2_ft_event(int state) return ret; } - opal_output_verbose(10, ompi_cr_output, - "bml:r2: ft_event(Restart): Cleanup restart files\n"); - opal_crs_base_cleanup_flush(); - /* * Re-open the BTL framework to get the full list of components. * - but first clear the MCA value that was there diff --git a/ompi/mca/btl/mx/btl_mx.c b/ompi/mca/btl/mx/btl_mx.c index 8f207b4df5..59a00b4706 100644 --- a/ompi/mca/btl/mx/btl_mx.c +++ b/ompi/mca/btl/mx/btl_mx.c @@ -641,7 +641,7 @@ int mca_btl_mx_ft_event(int state) { * kernel: blcr: thaw_threads returned error, aborting. -1 * JJH: It may be possible to, instead of restarting the entire driver, just reconnect endpoints */ - ompi_cr_continue_like_restart = true; + orte_cr_continue_like_restart = true; for( i = 0; i < mca_btl_mx_component.mx_num_btls; i++ ) { mx_btl = mca_btl_mx_component.mx_btls[i]; diff --git a/ompi/mca/btl/openib/btl_openib.c b/ompi/mca/btl/openib/btl_openib.c index 20334277e8..a2712c045f 100644 --- a/ompi/mca/btl/openib/btl_openib.c +++ b/ompi/mca/btl/openib/btl_openib.c @@ -1735,7 +1735,7 @@ int mca_btl_openib_ft_event(int state) { if(OPAL_CRS_CHECKPOINT == state) { /* Continue must reconstruct the routes (including modex), since we * have to tear down the devices completely. */ - ompi_cr_continue_like_restart = true; + orte_cr_continue_like_restart = true; /* * To keep the node from crashing we need to call ibv_close_device diff --git a/ompi/mca/btl/sm/btl_sm.c b/ompi/mca/btl/sm/btl_sm.c index 7d045f26d7..7405ff9454 100644 --- a/ompi/mca/btl/sm/btl_sm.c +++ b/ompi/mca/btl/sm/btl_sm.c @@ -52,6 +52,7 @@ #if OPAL_ENABLE_FT_CR == 1 #include "opal/mca/crs/base/base.h" #include "opal/util/basename.h" +#include "orte/mca/sstore/sstore.h" #include "ompi/runtime/ompi_cr.h" #endif @@ -1099,8 +1100,6 @@ int mca_btl_sm_ft_event(int state) { } #else int mca_btl_sm_ft_event(int state) { - char * tmp_dir = NULL; - /* Notify mpool */ if( NULL != mca_btl_sm_component.sm_mpool && NULL != mca_btl_sm_component.sm_mpool->mpool_ft_event) { @@ -1114,17 +1113,14 @@ int mca_btl_sm_ft_event(int state) { * for these old file handles. The restart procedure will make sure * these files get cleaned up appropriately. */ - opal_crs_base_metadata_write_token(NULL, CRS_METADATA_TOUCH, mca_btl_sm_component.sm_seg->module_seg_path); - - /* Record the job session directory */ - opal_crs_base_metadata_write_token(NULL, CRS_METADATA_MKDIR, orte_process_info.job_session_dir); + orte_sstore.set_attr(orte_sstore_handle_current, + SSTORE_METADATA_LOCAL_TOUCH, + mca_btl_sm_component.sm_seg->module_seg_path); } } else if(OPAL_CRS_CONTINUE == state) { - if( ompi_cr_continue_like_restart ) { + if( orte_cr_continue_like_restart ) { if( NULL != mca_btl_sm_component.sm_seg ) { - /* Do not Add session directory on continue */ - /* Add shared memory file */ opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false); } @@ -1136,14 +1132,6 @@ int mca_btl_sm_ft_event(int state) { else if(OPAL_CRS_RESTART == state || OPAL_CRS_RESTART_PRE == state) { if( NULL != mca_btl_sm_component.sm_seg ) { - /* Add session directory */ - opal_crs_base_cleanup_append(orte_process_info.job_session_dir, true); - tmp_dir = opal_dirname(orte_process_info.job_session_dir); - if( NULL != tmp_dir ) { - opal_crs_base_cleanup_append(tmp_dir, true); - free(tmp_dir); - tmp_dir = NULL; - } /* Add shared memory file */ opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false); } diff --git a/ompi/mca/crcp/base/Makefile.am b/ompi/mca/crcp/base/Makefile.am index 076dcadab2..5dadf4552a 100644 --- a/ompi/mca/crcp/base/Makefile.am +++ b/ompi/mca/crcp/base/Makefile.am @@ -1,5 +1,5 @@ # -# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # Copyright (c) 2004-2005 The University of Tennessee and The University @@ -26,3 +26,4 @@ libmca_crcp_la_SOURCES += \ base/crcp_base_close.c \ base/crcp_base_select.c \ base/crcp_base_fns.c + diff --git a/ompi/mca/crcp/base/base.h b/ompi/mca/crcp/base/base.h index dc6c0bd83b..19cb659c9c 100644 --- a/ompi/mca/crcp/base/base.h +++ b/ompi/mca/crcp/base/base.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -60,6 +60,12 @@ BEGIN_C_DECLS */ OMPI_DECLSPEC int ompi_crcp_base_close(void); + /** + * Quiesce Interface (For MPI Ext.) + */ + OMPI_DECLSPEC int ompi_crcp_base_quiesce_start(MPI_Info *info); + OMPI_DECLSPEC int ompi_crcp_base_quiesce_end(MPI_Info *info); + /** * 'None' component functions * These are to be used when no component is selected. @@ -72,6 +78,10 @@ BEGIN_C_DECLS int ompi_crcp_base_module_init(void); int ompi_crcp_base_module_finalize(void); + /* Quiesce Interface */ + int ompi_crcp_base_none_quiesce_start(MPI_Info *info); + int ompi_crcp_base_none_quiesce_end(MPI_Info *info); + /* PML Interface */ ompi_crcp_base_pml_state_t* ompi_crcp_base_none_pml_enable( bool enable, ompi_crcp_base_pml_state_t* ); diff --git a/ompi/mca/crcp/base/crcp_base_fns.c b/ompi/mca/crcp/base/crcp_base_fns.c index 4977021c5a..504c3ff62c 100644 --- a/ompi/mca/crcp/base/crcp_base_fns.c +++ b/ompi/mca/crcp/base/crcp_base_fns.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -38,6 +38,7 @@ #include "ompi/mca/crcp/crcp.h" #include "ompi/mca/crcp/base/base.h" #include "ompi/mca/bml/base/base.h" +#include "ompi/info/info.h" #include "ompi/mca/pml/pml.h" #include "ompi/mca/pml/base/base.h" #include "ompi/mca/pml/base/pml_base_request.h" @@ -92,6 +93,19 @@ int ompi_crcp_base_module_finalize(void) return OMPI_SUCCESS; } +/**************** + * MPI Quiesce Interface + ****************/ +int ompi_crcp_base_none_quiesce_start(MPI_Info *info) +{ + return OMPI_SUCCESS; +} + +int ompi_crcp_base_none_quiesce_end(MPI_Info *info) +{ + return OMPI_SUCCESS; +} + /**************** * PML Wrapper ****************/ @@ -397,3 +411,24 @@ ompi_crcp_base_none_btl_ft_event(int state, /******************** * Utility functions ********************/ + +/****************** + * MPI Interface Functions + ******************/ +int ompi_crcp_base_quiesce_start(MPI_Info *info) +{ + if( NULL != ompi_crcp.quiesce_start ) { + return ompi_crcp.quiesce_start(info); + } else { + return OMPI_SUCCESS; + } +} + +int ompi_crcp_base_quiesce_end(MPI_Info *info) +{ + if( NULL != ompi_crcp.quiesce_end ) { + return ompi_crcp.quiesce_end(info); + } else { + return OMPI_SUCCESS; + } +} diff --git a/ompi/mca/crcp/base/crcp_base_select.c b/ompi/mca/crcp/base/crcp_base_select.c index 9c9481c90d..54baa6abf6 100644 --- a/ompi/mca/crcp/base/crcp_base_select.c +++ b/ompi/mca/crcp/base/crcp_base_select.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -63,6 +63,10 @@ static ompi_crcp_base_module_t none_module = { /** Finalization Function */ ompi_crcp_base_module_finalize, + /** Quiesce interface */ + ompi_crcp_base_none_quiesce_start, + ompi_crcp_base_none_quiesce_end, + /** PML Wrapper */ ompi_crcp_base_none_pml_enable, diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk.h b/ompi/mca/crcp/bkmrk/crcp_bkmrk.h index ac6eb48562..734dcaac28 100644 --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk.h +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -57,6 +57,12 @@ BEGIN_C_DECLS int ompi_crcp_bkmrk_pml_init(void); int ompi_crcp_bkmrk_pml_finalize(void); + /* + * Quiesce Interface + */ + int ompi_crcp_bkmrk_quiesce_start(MPI_Info *info); + int ompi_crcp_bkmrk_quiesce_end(MPI_Info *info); + END_C_DECLS #endif /* MCA_CRCP_HOKE_EXPORT_H */ diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_module.c b/ompi/mca/crcp/bkmrk/crcp_bkmrk_module.c index 1ca3a90313..0f461d9965 100644 --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_module.c +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_module.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -44,6 +44,10 @@ static ompi_crcp_base_module_t loc_module = { /** Finalization Function */ ompi_crcp_bkmrk_module_finalize, + /** Quiesce interface */ + ompi_crcp_bkmrk_quiesce_start, + ompi_crcp_bkmrk_quiesce_end, + /** PML Wrapper */ NULL, /* ompi_crcp_bkmrk_pml_enable, */ @@ -131,6 +135,34 @@ int ompi_crcp_bkmrk_module_finalize(void) return OMPI_SUCCESS; } +int ompi_crcp_bkmrk_quiesce_start(MPI_Info *info) +{ + OPAL_OUTPUT_VERBOSE((10, mca_crcp_bkmrk_component.super.output_handle, + "crcp:bkmrk: quiesce_start(--)")); +#if 0 + if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_start(QUIESCE_TAG_CKPT)) ) { + ; + } + return OMPI_SUCCESS; +#else + return OMPI_ERR_NOT_IMPLEMENTED; +#endif +} + +int ompi_crcp_bkmrk_quiesce_end(MPI_Info *info) +{ + OPAL_OUTPUT_VERBOSE((10, mca_crcp_bkmrk_component.super.output_handle, + "crcp:bkmrk: quiesce_end(--)")); +#if 0 + if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_end(QUIESCE_TAG_CONTINUE) ) ) { + ; + } + return OMPI_SUCCESS; +#else + return OMPI_ERR_NOT_IMPLEMENTED; +#endif +} + /****************** * Local functions ******************/ diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c index 5453ea19f9..6852acc7a2 100644 --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2010 The University of Tennessee and The University * of Tennessee Research Foundation. All rights @@ -2986,6 +2986,26 @@ int ompi_crcp_bkmrk_request_complete(struct ompi_request_t *request) } /**************** FT Event *****************/ +int ompi_crcp_bkmrk_pml_quiesce_start(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ) { + int ret, exit_status = OMPI_SUCCESS; + + if( OMPI_SUCCESS != (ret = ft_event_coordinate_peers()) ) { + exit_status = ret; + } + + return exit_status; +} + +int ompi_crcp_bkmrk_pml_quiesce_end(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ) { + int ret, exit_status = OMPI_SUCCESS; + + if( OMPI_SUCCESS != (ret = ft_event_finalize_exchange() ) ) { + exit_status = ret; + } + + return exit_status; +} + ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event( int state, ompi_crcp_base_pml_state_t* pml_state) @@ -3027,7 +3047,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event( * When we return from this function we know that all of our * channels have been flushed. */ - if( OMPI_SUCCESS != (ret = ft_event_coordinate_peers()) ) { + if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_start(QUIESCE_TAG_CKPT)) ) { opal_output(mca_crcp_bkmrk_component.super.output_handle, "crcp:bkmrk: %s ft_event: Checkpoint Coordination Failed %d", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), @@ -3060,7 +3080,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event( first_continue_pass = !first_continue_pass; /* Only finalize the Protocol after the PML has been rebuilt */ - if( ompi_cr_continue_like_restart && first_continue_pass ) { + if( orte_cr_continue_like_restart && first_continue_pass ) { goto DONE; } @@ -3069,7 +3089,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event( /* * Finish the coord protocol */ - if( OMPI_SUCCESS != (ret = ft_event_finalize_exchange() ) ) { + if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_end(QUIESCE_TAG_CONTINUE) ) ) { opal_output(mca_crcp_bkmrk_component.super.output_handle, "crcp:bkmrk: pml_ft_event: Checkpoint Finalization Failed %d", ret); diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.h b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.h index 8e4a5fd8af..aef6b50978 100644 --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.h +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -116,6 +116,18 @@ BEGIN_C_DECLS ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event (int state, ompi_crcp_base_pml_state_t* pml_state); + enum ompi_crcp_bkmrk_pml_quiesce_tag_type_t { + QUIESCE_TAG_NONE = 0, /* 0 No tag specified */ + QUIESCE_TAG_CKPT, /* 1 Prepare for checkpoint */ + QUIESCE_TAG_CONTINUE, /* 2 Continue after a checkpoint */ + QUIESCE_TAG_RESTART, /* 3 Restart from a checkpoint */ + QUIESCE_TAG_UNKNOWN /* 4 Unknown */ + }; + typedef enum ompi_crcp_bkmrk_pml_quiesce_tag_type_t ompi_crcp_bkmrk_pml_quiesce_tag_type_t; + + int ompi_crcp_bkmrk_pml_quiesce_start(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ); + int ompi_crcp_bkmrk_pml_quiesce_end(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ); + /* * Request function */ diff --git a/ompi/mca/crcp/crcp.h b/ompi/mca/crcp/crcp.h index ac01b8f537..cc82881887 100644 --- a/ompi/mca/crcp/crcp.h +++ b/ompi/mca/crcp/crcp.h @@ -61,6 +61,23 @@ typedef int (*ompi_crcp_base_module_init_fn_t) typedef int (*ompi_crcp_base_module_finalize_fn_t) (void); + +/************************ + * MPI Quiesce Interface + ************************/ +/** + * MPI_Quiesce_start component interface + */ +typedef int (*ompi_crcp_base_quiesce_start_fn_t) + (MPI_Info *info); + +/** + * MPI_Quiesce_end component interface + */ +typedef int (*ompi_crcp_base_quiesce_end_fn_t) + (MPI_Info *info); + + /************************ * PML Wrapper hooks * PML Wrapper is the CRCPW PML component @@ -283,6 +300,10 @@ struct ompi_crcp_base_module_1_0_0_t { /** Finalization Function */ ompi_crcp_base_module_finalize_fn_t crcp_finalize; + /**< MPI_Quiesce Interface Functions ******************/ + ompi_crcp_base_quiesce_start_fn_t quiesce_start; + ompi_crcp_base_quiesce_end_fn_t quiesce_end; + /**< PML Wrapper Functions ****************************/ ompi_crcp_base_pml_enable_fn_t pml_enable; diff --git a/ompi/mca/mpool/sm/mpool_sm_module.c b/ompi/mca/mpool/sm/mpool_sm_module.c index 721c402423..52916b033f 100644 --- a/ompi/mca/mpool/sm/mpool_sm_module.c +++ b/ompi/mca/mpool/sm/mpool_sm_module.c @@ -32,6 +32,7 @@ #include "orte/util/proc_info.h" #if OPAL_ENABLE_FT_CR == 1 +#include "orte/mca/sstore/sstore.h" #include "ompi/mca/mpool/base/base.h" #include "ompi/runtime/ompi_cr.h" #endif @@ -169,12 +170,12 @@ int mca_mpool_sm_ft_event(int state) { asprintf( &file_name, "%s"OPAL_PATH_SEP"shared_mem_pool.%s", orte_process_info.job_session_dir, orte_process_info.nodename ); - opal_crs_base_metadata_write_token(NULL, CRS_METADATA_TOUCH, file_name); + orte_sstore.set_attr(orte_sstore_handle_current, SSTORE_METADATA_LOCAL_TOUCH, file_name); free(file_name); file_name = NULL; } else if(OPAL_CRS_CONTINUE == state) { - if(ompi_cr_continue_like_restart) { + if(orte_cr_continue_like_restart) { /* Find the sm module */ self_module = mca_mpool_base_module_lookup("sm"); self_sm_module = (mca_mpool_sm_module_t*) self_module; diff --git a/ompi/mca/pml/bfo/pml_bfo.c b/ompi/mca/pml/bfo/pml_bfo.c index f57dab4b48..d0f15f2fa4 100644 --- a/ompi/mca/pml/bfo/pml_bfo.c +++ b/ompi/mca/pml/bfo/pml_bfo.c @@ -691,7 +691,7 @@ int mca_pml_bfo_ft_event( int state ) OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2); } - if( ompi_cr_continue_like_restart && !first_continue_pass ) { + if( orte_cr_continue_like_restart && !first_continue_pass ) { /* * Get a list of processes */ @@ -791,7 +791,7 @@ int mca_pml_bfo_ft_event( int state ) OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3); } - if( ompi_cr_continue_like_restart && !first_continue_pass ) { + if( orte_cr_continue_like_restart && !first_continue_pass ) { /* * Exchange the modex information once again. * BTLs will have republished their modex information. diff --git a/ompi/mca/pml/csum/pml_csum.c b/ompi/mca/pml/csum/pml_csum.c index c50d3314f4..4be57e8fa0 100644 --- a/ompi/mca/pml/csum/pml_csum.c +++ b/ompi/mca/pml/csum/pml_csum.c @@ -669,7 +669,7 @@ int mca_pml_csum_ft_event( int state ) OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2); } - if( ompi_cr_continue_like_restart && !first_continue_pass ) { + if( orte_cr_continue_like_restart && !first_continue_pass ) { /* * Get a list of processes */ @@ -769,7 +769,7 @@ int mca_pml_csum_ft_event( int state ) OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3); } - if( ompi_cr_continue_like_restart && !first_continue_pass ) { + if( orte_cr_continue_like_restart && !first_continue_pass ) { /* * Exchange the modex information once again. * BTLs will have republished their modex information. diff --git a/ompi/mca/pml/ob1/pml_ob1.c b/ompi/mca/pml/ob1/pml_ob1.c index 69ebb33921..cdd2b0c6db 100644 --- a/ompi/mca/pml/ob1/pml_ob1.c +++ b/ompi/mca/pml/ob1/pml_ob1.c @@ -638,7 +638,7 @@ int mca_pml_ob1_ft_event( int state ) OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2); } - if( ompi_cr_continue_like_restart && !first_continue_pass ) { + if( orte_cr_continue_like_restart && !first_continue_pass ) { /* * Get a list of processes */ @@ -738,7 +738,7 @@ int mca_pml_ob1_ft_event( int state ) OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3); } - if( ompi_cr_continue_like_restart && !first_continue_pass ) { + if( orte_cr_continue_like_restart && !first_continue_pass ) { /* * Exchange the modex information once again. * BTLs will have republished their modex information. diff --git a/ompi/mpiext/cr/Makefile.am b/ompi/mpiext/cr/Makefile.am new file mode 100644 index 0000000000..e96bdd2ad7 --- /dev/null +++ b/ompi/mpiext/cr/Makefile.am @@ -0,0 +1,38 @@ +# +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +headers = \ + mpiext_cr_c.h + +sources = \ + c/checkpoint.c \ + c/restart.c \ + c/migrate.c \ + c/inc_register_callback.c \ + c/quiesce_start.c \ + c/quiesce_end.c \ + c/quiesce_checkpoint.c \ + c/self_register_checkpoint.c \ + c/self_register_restart.c \ + c/self_register_continue.c + +lib = libext_mpiext_cr.la +lib_sources = $(sources) + +extcomponentdir = $(pkglibdir) + +noinst_LTLIBRARIES = $(lib) +libext_mpiext_cr_la_SOURCES = $(lib_sources) +libext_mpiext_cr_la_LDFLAGS = -module -avoid-version + +ompidir = $(includedir)/openmpi/ompi/mpiext/cr +ompi_HEADERS = \ + $(headers) diff --git a/ompi/mpiext/cr/c/checkpoint.c b/ompi/mpiext/cr/c/checkpoint.c new file mode 100644 index 0000000000..9435e122a9 --- /dev/null +++ b/ompi/mpiext/cr/c/checkpoint.c @@ -0,0 +1,88 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "ompi/info/info.h" +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "orte/mca/snapc/snapc.h" + +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +static const char FUNC_NAME[] = "OMPI_CR_Checkpoint"; +#define HANDLE_SIZE_MAX 256 + +int OMPI_CR_Checkpoint(char **handle, int *seq, MPI_Info *info) +{ + int ret = MPI_SUCCESS; + MPI_Comm comm = MPI_COMM_WORLD; + orte_snapc_base_request_op_t *datum = NULL; + int state = 0; + int my_rank; + + /* argument checking */ + if (MPI_PARAM_CHECK) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + /* + * Setup the data structure for the operation + */ + datum = OBJ_NEW(orte_snapc_base_request_op_t); + datum->event = ORTE_SNAPC_OP_CHECKPOINT; + datum->is_active = true; + + MPI_Comm_rank(comm, &my_rank); + if( 0 == my_rank ) { + datum->leader = ORTE_PROC_MY_NAME->vpid; + } else { + datum->leader = -1; /* Unknown from non-root ranks */ + } + + /* + * All processes must make this call before it can start + */ + MPI_Barrier(comm); + + /* + * Leader sends the request + */ + OPAL_CR_ENTER_LIBRARY(); + ret = orte_snapc.request_op(datum); + if( OMPI_SUCCESS != ret ) { + OBJ_RELEASE(datum); + OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER, + FUNC_NAME); + } + OPAL_CR_EXIT_LIBRARY(); + + /* + * Leader then sends out the commit message + */ + if( datum->leader == (int)ORTE_PROC_MY_NAME->vpid ) { + *handle = strdup(datum->global_handle); + *seq = datum->seq_num; + state = 0; + } else { + *handle = (char*)malloc(sizeof(char)*HANDLE_SIZE_MAX); + } + + MPI_Bcast(&state, 1, MPI_INT, 0, comm); + MPI_Bcast(seq, 1, MPI_INT, 0, comm); + MPI_Bcast(*handle, HANDLE_SIZE_MAX, MPI_CHAR, 0, comm); + + datum->is_active = false; + OBJ_RELEASE(datum); + + return ret; +} diff --git a/ompi/mpiext/cr/c/inc_register_callback.c b/ompi/mpiext/cr/c/inc_register_callback.c new file mode 100644 index 0000000000..fd0972feb3 --- /dev/null +++ b/ompi/mpiext/cr/c/inc_register_callback.c @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "opal/runtime/opal_cr.h" +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "ompi/errhandler/errhandler.h" + +static const char FUNC_NAME[] = "OMPI_CR_INC_register_callback"; + +int OMPI_CR_INC_register_callback(OMPI_CR_INC_callback_event_t event, + OMPI_CR_INC_callback_function function, + OMPI_CR_INC_callback_function *prev_function) +{ + int rc; + + if ( MPI_PARAM_CHECK ) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + OPAL_CR_ENTER_LIBRARY(); + + rc = opal_cr_user_inc_register_callback(event, function, prev_function); + + OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME); +} diff --git a/ompi/mpiext/cr/c/migrate.c b/ompi/mpiext/cr/c/migrate.c new file mode 100644 index 0000000000..ff7bcae8c2 --- /dev/null +++ b/ompi/mpiext/cr/c/migrate.c @@ -0,0 +1,120 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "ompi/info/info.h" +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "orte/mca/snapc/snapc.h" + +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +static const char FUNC_NAME[] = "OMPI_CR_Migrate"; + +int OMPI_CR_Migrate(MPI_Comm comm, char *hostname, int rank, MPI_Info *info) +{ + int ret = MPI_SUCCESS; + orte_snapc_base_request_op_t *datum = NULL; + int my_rank, my_size, i; + char loc_hostname[MPI_MAX_PROCESSOR_NAME]; + int my_vpid; + int info_flag; + char info_value[6]; + int my_off_node = (int)false; + + /* argument checking */ + if (MPI_PARAM_CHECK) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + /* + * Setup the data structure for the operation + */ + datum = OBJ_NEW(orte_snapc_base_request_op_t); + datum->event = ORTE_SNAPC_OP_MIGRATE; + datum->is_active = true; + + MPI_Comm_rank(comm, &my_rank); + MPI_Comm_size(comm, &my_size); + if( 0 == my_rank ) { + datum->leader = ORTE_PROC_MY_NAME->vpid; + } else { + datum->leader = -1; /* Unknown from non-root ranks */ + } + + /* + * Gather all preferences to the root + */ + if( NULL == hostname ) { + loc_hostname[0] = '\0'; + } else { + strncpy(loc_hostname, hostname, strlen(hostname)); + loc_hostname[strlen(hostname)] = '\0'; + } + my_vpid = (int) ORTE_PROC_MY_NAME->vpid; + + if( 0 == my_rank ) { + datum->mig_num = my_size; + datum->mig_vpids = malloc(sizeof(int) * my_size); + datum->mig_host_pref = malloc(sizeof(char) * my_size * MPI_MAX_PROCESSOR_NAME); + datum->mig_vpid_pref = malloc(sizeof(int) * my_size); + datum->mig_off_node = malloc(sizeof(int) * my_size); + + for( i = 0; i < my_size; ++i ) { + (datum->mig_vpids)[i] = 0; + (datum->mig_host_pref)[i][0] = '\0'; + (datum->mig_vpid_pref)[i] = 0; + (datum->mig_off_node)[i] = (int)false; + } + } + + my_off_node = (int)false; + if( NULL != info ) { + MPI_Info_get(*info, "CR_OFF_NODE", 5, info_value, &info_flag); + if( info_flag ) { + if( 0 == strncmp(info_value, "true", strlen("true")) ) { + my_off_node = (int)true; + } + } + } + + MPI_Gather(&my_vpid, 1, MPI_INT, + (datum->mig_vpids), 1, MPI_INT, 0, comm); + MPI_Gather(loc_hostname, MPI_MAX_PROCESSOR_NAME, MPI_CHAR, + (datum->mig_host_pref), MPI_MAX_PROCESSOR_NAME, MPI_CHAR, 0, comm); + MPI_Gather(&my_vpid, 1, MPI_INT, + (datum->mig_vpid_pref), 1, MPI_INT, 0, comm); + MPI_Gather(&my_off_node, 1, MPI_INT, + (datum->mig_off_node), 1, MPI_INT, 0, comm); + + /* + * Leader sends the request + */ + OPAL_CR_ENTER_LIBRARY(); + ret = orte_snapc.request_op(datum); + if( OMPI_SUCCESS != ret ) { + OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER, + FUNC_NAME); + } + OPAL_CR_EXIT_LIBRARY(); + + datum->is_active = false; + OBJ_RELEASE(datum); + + /* + * All processes must sync before leaving + */ + MPI_Barrier(comm); + + return ret; +} diff --git a/ompi/mpiext/cr/c/quiesce_checkpoint.c b/ompi/mpiext/cr/c/quiesce_checkpoint.c new file mode 100644 index 0000000000..4d436be9bc --- /dev/null +++ b/ompi/mpiext/cr/c/quiesce_checkpoint.c @@ -0,0 +1,69 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "ompi/info/info.h" +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "orte/mca/snapc/snapc.h" + +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +static const char FUNC_NAME[] = "OMPI_CR_Quiesce_checkpoint"; + +int OMPI_CR_Quiesce_checkpoint(MPI_Comm commP, char **handle, int *seq, MPI_Info *info) +{ + int ret = MPI_SUCCESS; + MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */ + orte_snapc_base_request_op_t *datum = NULL; + int my_rank; + + /* argument checking */ + if (MPI_PARAM_CHECK) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + /* + * Setup the data structure for the operation + */ + datum = OBJ_NEW(orte_snapc_base_request_op_t); + datum->event = ORTE_SNAPC_OP_QUIESCE_CHECKPOINT; + datum->is_active = true; + + MPI_Comm_rank(comm, &my_rank); + if( 0 == my_rank ) { + datum->leader = ORTE_PROC_MY_NAME->vpid; + } else { + datum->leader = -1; /* Unknown from non-root ranks */ + } + + /* + * Since we are quiescent, then this is a local operation + */ + OPAL_CR_ENTER_LIBRARY(); + ret = orte_snapc.request_op(datum); + /*ret = ompi_crcp_base_quiesce_start(info);*/ + if( OMPI_SUCCESS != ret ) { + OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_OTHER, + FUNC_NAME); + } + OPAL_CR_EXIT_LIBRARY(); + + *handle = strdup(datum->global_handle); + *seq = datum->seq_num; + + datum->is_active = false; + OBJ_RELEASE(datum); + + return ret; +} diff --git a/ompi/mpiext/cr/c/quiesce_end.c b/ompi/mpiext/cr/c/quiesce_end.c new file mode 100644 index 0000000000..9fc1603df5 --- /dev/null +++ b/ompi/mpiext/cr/c/quiesce_end.c @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "ompi/info/info.h" +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "orte/mca/snapc/snapc.h" + +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +static const char FUNC_NAME[] = "OMPI_CR_Quiesce_end"; + +int OMPI_CR_Quiesce_end(MPI_Comm commP, MPI_Info *info) +{ + int ret = MPI_SUCCESS; + MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */ + orte_snapc_base_request_op_t *datum = NULL; + int my_rank; + + /* argument checking */ + if (MPI_PARAM_CHECK) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + /* + * Setup the data structure for the operation + */ + datum = OBJ_NEW(orte_snapc_base_request_op_t); + datum->event = ORTE_SNAPC_OP_QUIESCE_END; + datum->is_active = true; + + MPI_Comm_rank(comm, &my_rank); + if( 0 == my_rank ) { + datum->leader = ORTE_PROC_MY_NAME->vpid; + } else { + datum->leader = -1; /* Unknown from non-root ranks */ + } + + /* + * Leader sends the request + */ + OPAL_CR_ENTER_LIBRARY(); + ret = orte_snapc.request_op(datum); + /*ret = ompi_crcp_base_quiesce_end(info);*/ + if( OMPI_SUCCESS != ret ) { + OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER, + FUNC_NAME); + } + OPAL_CR_EXIT_LIBRARY(); + + /* + * All processes must make this call before it can complete + */ + MPI_Barrier(comm); + + /* + * (Old) info logic + */ + /*cur_datum.epoch = -1;*/ + + return ret; +} + diff --git a/ompi/mpiext/cr/c/quiesce_start.c b/ompi/mpiext/cr/c/quiesce_start.c new file mode 100644 index 0000000000..2262dd55ff --- /dev/null +++ b/ompi/mpiext/cr/c/quiesce_start.c @@ -0,0 +1,210 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "ompi/info/info.h" +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "orte/mca/snapc/snapc.h" + +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +static const char FUNC_NAME[] = "OMPI_CR_Quiesce_start"; + +int OMPI_CR_Quiesce_start(MPI_Comm commP, MPI_Info *info) +{ + int ret = MPI_SUCCESS; + MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */ + orte_snapc_base_request_op_t *datum = NULL; + int my_rank; + + /* argument checking */ + if (MPI_PARAM_CHECK) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + /* + * Setup the data structure for the operation + */ + datum = OBJ_NEW(orte_snapc_base_request_op_t); + datum->event = ORTE_SNAPC_OP_QUIESCE_START; + datum->is_active = true; + + MPI_Comm_rank(comm, &my_rank); + if( 0 == my_rank ) { + datum->leader = ORTE_PROC_MY_NAME->vpid; + } else { + datum->leader = -1; /* Unknown from non-root ranks */ + } + + /* + * All processes must make this call before it can start + */ + MPI_Barrier(comm); + + /* + * Leader sends the request + */ + OPAL_CR_ENTER_LIBRARY(); + ret = orte_snapc.request_op(datum); + /*ret = ompi_crcp_base_quiesce_start(info);*/ + if( OMPI_SUCCESS != ret ) { + OBJ_RELEASE(datum); + OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER, + FUNC_NAME); + } + + OPAL_CR_EXIT_LIBRARY(); + + datum->is_active = false; + OBJ_RELEASE(datum); + + /* + * (Old) info logic + */ + /*ompi_info_set((ompi_info_t*)*info, "target", cur_datum.target_dir);*/ + + return ret; +} + +/***************** + * Local Functions + ******************/ +#if 0 +/* Info keys: + * + * - crs: + * none = (Default) No CRS Service + * default = Whatever CRS service MPI chooses + * blcr = BLCR + * self = app level callbacks + * + * - cmdline: + * Command line to restart the process with. + * If empty, the user must manually enter it + * + * - target: + * Absolute path to the target directory. + * + * - handle: + * first = Earliest checkpoint directory available + * last = Most recent checkpoint directory available + * [global:local] = handle provided by the MPI library + * + * - restarting: + * 0 = not restarting + * 1 = restarting + * + * - checkpointing: + * 0 = No need to prepare for checkpointing + * 1 = MPI should prepare for checkpointing + * + * - inflight: + * default = message + * message = Drain inflight messages at the message level + * network = Drain inflight messages at the network level (if possible) + * + * - user_space_mem: + * 0 = Memory does not need to be managed + * 1 = Memory must be in user space (i.e., not on network card + * + */ +static int extract_info_into_datum(ompi_info_t *info, orte_snapc_base_quiesce_t *datum) +{ + int info_flag = false; + int max_crs_len = 32; + bool info_bool = false; + char *info_char = NULL; + + info_char = (char *) malloc(sizeof(char) * (OPAL_PATH_MAX+1)); + + /* + * Key: crs + */ + ompi_info_get(info, "crs", max_crs_len, info_char, &info_flag); + if( info_flag) { + datum->crs_name = strdup(info_char); + } + + /* + * Key: cmdline + */ + ompi_info_get(info, "cmdline", OPAL_PATH_MAX, info_char, &info_flag); + if( info_flag) { + datum->cmdline = strdup(info_char); + } + + /* + * Key: handle + */ + ompi_info_get(info, "handle", OPAL_PATH_MAX, info_char, &info_flag); + if( info_flag) { + datum->handle = strdup(info_char); + } + + /* + * Key: target + */ + ompi_info_get(info, "target", OPAL_PATH_MAX, info_char, &info_flag); + if( info_flag) { + datum->target_dir = strdup(info_char); + } + + /* + * Key: restarting + */ + ompi_info_get_bool(info, "restarting", &info_bool, &info_flag); + if( info_flag ) { + datum->restarting = info_bool; + } else { + datum->restarting = false; + } + + /* + * Key: checkpointing + */ + ompi_info_get_bool(info, "checkpointing", &info_bool, &info_flag); + if( info_flag ) { + datum->checkpointing = info_bool; + } else { + datum->checkpointing = false; + } + + /* + * Display all values + */ + OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle, + "crcp:bkmrk: %s extract_info: Info('crs' = '%s')", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + (NULL == datum->crs_name ? "Default (none)" : datum->crs_name))); + OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle, + "crcp:bkmrk: %s extract_info: Info('cmdline' = '%s')", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + (NULL == datum->cmdline ? "Default ()" : datum->cmdline))); + OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle, + "crcp:bkmrk: %s extract_info: Info('checkpointing' = '%c')", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + (datum->checkpointing ? 'T' : 'F'))); + OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle, + "crcp:bkmrk: %s extract_info: Info('restarting' = '%c')", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + (datum->restarting ? 'T' : 'F'))); + + if( NULL != info_char ) { + free(info_char); + info_char = NULL; + } + + return ORTE_SUCCESS; +} +#endif diff --git a/ompi/mpiext/cr/c/restart.c b/ompi/mpiext/cr/c/restart.c new file mode 100644 index 0000000000..1d8b69f30f --- /dev/null +++ b/ompi/mpiext/cr/c/restart.c @@ -0,0 +1,66 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "ompi/info/info.h" +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "orte/mca/snapc/snapc.h" + +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +static const char FUNC_NAME[] = "OMPI_CR_Restart"; + +int OMPI_CR_Restart(char *handle, int seq, MPI_Info *info) +{ + int ret = MPI_SUCCESS; + MPI_Comm comm = MPI_COMM_WORLD; + orte_snapc_base_request_op_t *datum = NULL; + + /* argument checking */ + if (MPI_PARAM_CHECK) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + /* + * Setup the data structure for the operation + */ + datum = OBJ_NEW(orte_snapc_base_request_op_t); + datum->event = ORTE_SNAPC_OP_RESTART; + datum->is_active = true; + + /* + * Restart is not collective, so the caller is the leader + */ + datum->leader = ORTE_PROC_MY_NAME->vpid; + datum->seq_num = seq; + datum->global_handle = strdup(handle); + + /* + * Leader sends the request + */ + OPAL_CR_ENTER_LIBRARY(); + ret = orte_snapc.request_op(datum); + if( OMPI_SUCCESS != ret ) { + OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER, + FUNC_NAME); + } + OPAL_CR_EXIT_LIBRARY(); + + datum->is_active = false; + OBJ_RELEASE(datum); + + /********** If successful, should never reach this point (JJH) ******/ + + return ret; +} diff --git a/ompi/mpiext/cr/c/self_register_checkpoint.c b/ompi/mpiext/cr/c/self_register_checkpoint.c new file mode 100644 index 0000000000..49be2415b7 --- /dev/null +++ b/ompi/mpiext/cr/c/self_register_checkpoint.c @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "opal/runtime/opal_cr.h" +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "ompi/errhandler/errhandler.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" + +static const char FUNC_NAME[] = "OMPI_CR_self_register_checkpoint_callback"; + +int OMPI_CR_self_register_checkpoint_callback(OMPI_CR_self_checkpoint_fn function) +{ + int rc; + + if ( MPI_PARAM_CHECK ) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + OPAL_CR_ENTER_LIBRARY(); + + rc = opal_crs_base_self_register_checkpoint_callback(function); + + OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME); +} diff --git a/ompi/mpiext/cr/c/self_register_continue.c b/ompi/mpiext/cr/c/self_register_continue.c new file mode 100644 index 0000000000..7b50d83bb5 --- /dev/null +++ b/ompi/mpiext/cr/c/self_register_continue.c @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "opal/runtime/opal_cr.h" +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "ompi/errhandler/errhandler.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" + +static const char FUNC_NAME[] = "OMPI_CR_self_register_continue_callback"; + +int OMPI_CR_self_register_continue_callback(OMPI_CR_self_continue_fn function) +{ + int rc; + + if ( MPI_PARAM_CHECK ) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + OPAL_CR_ENTER_LIBRARY(); + + rc = opal_crs_base_self_register_continue_callback(function); + + OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME); +} diff --git a/ompi/mpiext/cr/c/self_register_restart.c b/ompi/mpiext/cr/c/self_register_restart.c new file mode 100644 index 0000000000..20be3a9079 --- /dev/null +++ b/ompi/mpiext/cr/c/self_register_restart.c @@ -0,0 +1,39 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#include "ompi_config.h" +#include + +#include "ompi/mpi/c/bindings.h" +#include "opal/runtime/opal_cr.h" +#include "ompi/mpiext/cr/mpiext_cr_c.h" + +#include "ompi/runtime/params.h" +#include "ompi/communicator/communicator.h" +#include "ompi/errhandler/errhandler.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" + +static const char FUNC_NAME[] = "OMPI_CR_self_register_restart_callback"; + +int OMPI_CR_self_register_restart_callback(OMPI_CR_self_restart_fn function) +{ + int rc; + + if ( MPI_PARAM_CHECK ) { + OMPI_ERR_INIT_FINALIZE(FUNC_NAME); + } + + OPAL_CR_ENTER_LIBRARY(); + + rc = opal_crs_base_self_register_restart_callback(function); + + OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME); +} diff --git a/ompi/mpiext/cr/configure.m4 b/ompi/mpiext/cr/configure.m4 new file mode 100644 index 0000000000..d3b00217c7 --- /dev/null +++ b/ompi/mpiext/cr/configure.m4 @@ -0,0 +1,19 @@ +# -*- shell-script -*- +# +# Copyright (c) 2004-2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# EXT_ompi_cr_CONFIG([action-if-found], [action-if-not-found]) +# ----------------------------------------------------------- +AC_DEFUN([EXT_mpiext_cr_CONFIG],[ + # If we don't want FT, don't compile this component + AS_IF([test "$opal_want_ft_cr" = "1"], + [$1], + [$2]) +])dnl diff --git a/ompi/mpiext/cr/configure.params b/ompi/mpiext/cr/configure.params new file mode 100644 index 0000000000..b2caad05a6 --- /dev/null +++ b/ompi/mpiext/cr/configure.params @@ -0,0 +1,12 @@ +# -*- shell-script -*- +# +# Copyright (c) 2004-2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_CONFIG_FILES="Makefile" diff --git a/ompi/mpiext/cr/mpiext_cr_c.h b/ompi/mpiext/cr/mpiext_cr_c.h new file mode 100644 index 0000000000..febf046e36 --- /dev/null +++ b/ompi/mpiext/cr/mpiext_cr_c.h @@ -0,0 +1,82 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + * + */ +#include "opal/runtime/opal_cr.h" + +/******************************** + * C/R Interfaces + ********************************/ +/* + * Request a checkpoint + */ +OMPI_DECLSPEC int OMPI_CR_Checkpoint(char **handle, int *seq, MPI_Info *info); + +/* + * Request a restart + */ +OMPI_DECLSPEC int OMPI_CR_Restart(char *handle, int seq, MPI_Info *info); + + +/******************************** + * Migration Interface + ********************************/ +/* + * Request a migration + */ +OMPI_DECLSPEC int OMPI_CR_Migrate(MPI_Comm comm, char *hostname, int rank, MPI_Info *info); + + +/******************************** + * INC Interfaces + ********************************/ +typedef opal_cr_user_inc_callback_event_t OMPI_CR_INC_callback_event_t; + +typedef opal_cr_user_inc_callback_state_t OMPI_CR_INC_callback_state_t; + +typedef int (*OMPI_CR_INC_callback_function)(OMPI_CR_INC_callback_event_t event, + OMPI_CR_INC_callback_state_t state); + +OMPI_DECLSPEC int OMPI_CR_INC_register_callback(OMPI_CR_INC_callback_event_t event, + OMPI_CR_INC_callback_function function, + OMPI_CR_INC_callback_function *prev_function); + + +/******************************** + * SELF CRS Application Interfaces + ********************************/ +typedef int (*OMPI_CR_self_checkpoint_fn)(char **restart_cmd); +typedef int (*OMPI_CR_self_restart_fn)(void); +typedef int (*OMPI_CR_self_continue_fn)(void); + +OMPI_DECLSPEC int OMPI_CR_self_register_checkpoint_callback(OMPI_CR_self_checkpoint_fn function); +OMPI_DECLSPEC int OMPI_CR_self_register_restart_callback(OMPI_CR_self_restart_fn function); +OMPI_DECLSPEC int OMPI_CR_self_register_continue_callback(OMPI_CR_self_continue_fn function); + + +/******************************** + * Quiescence Interfaces + ********************************/ +/* + * Start the Quiescent region. + * Note: 'comm' required to be MPI_COMM_WORLD + */ +OMPI_DECLSPEC int OMPI_CR_Quiesce_start(MPI_Comm comm, MPI_Info *info); + +/* + * Request a checkpoint during a quiescent region + * Note: 'comm' required to be MPI_COMM_WORLD + */ +OMPI_DECLSPEC int OMPI_CR_Quiesce_checkpoint(MPI_Comm comm, char **handle, int *seq, MPI_Info *info); + +/* + * End the Quiescent Region + * Note: 'comm' required to be MPI_COMM_WORLD + */ +OMPI_DECLSPEC int OMPI_CR_Quiesce_end(MPI_Comm comm, MPI_Info *info); diff --git a/ompi/runtime/ompi_cr.c b/ompi/runtime/ompi_cr.c index 1859ec8932..f52d954245 100644 --- a/ompi/runtime/ompi_cr.c +++ b/ompi/runtime/ompi_cr.c @@ -1,6 +1,6 @@ /* -*- Mode: C; c-basic-offset:4 ; -*- */ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2007 The University of Tennessee and The University @@ -43,6 +43,7 @@ #include "opal/util/output.h" #include "opal/mca/crs/crs.h" #include "opal/mca/crs/base/base.h" +#include "opal/mca/installdirs/installdirs.h" #include "opal/runtime/opal_cr.h" #include "orte/mca/snapc/snapc.h" @@ -56,6 +57,18 @@ #include "ompi/mca/crcp/base/base.h" #include "ompi/communicator/communicator.h" #include "ompi/runtime/ompi_cr.h" +#if OPAL_ENABLE_CRDEBUG == 1 +#include "orte/runtime/orte_globals.h" +#include "ompi/debuggers/debuggers.h" +#endif + +#if OPAL_ENABLE_CRDEBUG == 1 +OMPI_DECLSPEC int MPIR_checkpointable = 0; +OMPI_DECLSPEC char * MPIR_controller_hostname = NULL; +OMPI_DECLSPEC char * MPIR_checkpoint_command = NULL; +OMPI_DECLSPEC char * MPIR_restart_command = NULL; +OMPI_DECLSPEC char * MPIR_checkpoint_listing_command = NULL; +#endif /************* * Local functions @@ -68,8 +81,6 @@ static int ompi_cr_coord_post_ckpt(void); static int ompi_cr_coord_post_restart(void); static int ompi_cr_coord_post_continue(void); -bool ompi_cr_continue_like_restart = false; - /************* * Local vars *************/ @@ -157,15 +168,59 @@ int ompi_cr_init(void) ompi_cr_output = opal_cr_output; } - /* Typically this is not needed. Individual BTLs will set this as needed */ - ompi_cr_continue_like_restart = false; - opal_output_verbose(10, ompi_cr_output, "ompi_cr: init: ompi_cr_init()"); /* Register the OMPI interlevel coordination callback */ opal_cr_reg_coord_callback(ompi_cr_coord, &prev_coord_callback); - + +#if OPAL_ENABLE_CRDEBUG == 1 + /* Check for C/R enabled debugging */ + if( MPIR_debug_with_checkpoint ) { + char *uri = NULL; + char *sep = NULL; + char *hostname = NULL; + + /* Mark as debuggable with C/R */ + MPIR_checkpointable = 1; + + /* Set the checkpoint and restart commands */ + /* Add the full path to the binary */ + asprintf(&MPIR_checkpoint_command, + "%s/ompi-checkpoint --crdebug --hnp-jobid %u", + opal_install_dirs.bindir, + ORTE_PROC_MY_HNP->jobid); + asprintf(&MPIR_restart_command, + "%s/ompi-restart --crdebug ", + opal_install_dirs.bindir); + asprintf(&MPIR_checkpoint_listing_command, + "%s/ompi-checkpoint -l --crdebug ", + opal_install_dirs.bindir); + + /* Set contact information for HNP */ + uri = strdup(orte_process_info.my_hnp_uri); + hostname = strchr(uri, ';') + 1; + sep = strchr(hostname, ';'); + if (sep) { + *sep = 0; + } + if (strncmp(hostname, "tcp://", 6) == 0) { + hostname += 6; + sep = strchr(hostname, ':'); + *sep = 0; + MPIR_controller_hostname = strdup(hostname); + } else { + MPIR_controller_hostname = strdup("localhost"); + } + + /* Cleanup */ + if( NULL != uri ) { + free(uri); + uri = NULL; + } + } +#endif + return OMPI_SUCCESS; } @@ -196,9 +251,6 @@ int ompi_cr_coord(int state) * take action given the state. */ if(OPAL_CRS_CHECKPOINT == state) { - /* Default: use the fast way */ - ompi_cr_continue_like_restart = false; - /* Do Checkpoint Phase work */ ret = ompi_cr_coord_pre_ckpt(); if( ret == OMPI_EXISTS) { @@ -245,10 +297,30 @@ int ompi_cr_coord(int state) else if (OPAL_CRS_CONTINUE == state ) { /* Do Continue Phase work */ ompi_cr_coord_post_continue(); + +#if OPAL_ENABLE_CRDEBUG == 1 + /* + * If C/R enabled debugging, + * wait here for debugger to attach + */ + if( MPIR_debug_with_checkpoint ) { + MPIR_checkpoint_debugger_breakpoint(); + } +#endif } else if (OPAL_CRS_RESTART == state ) { /* Do Restart Phase work */ ompi_cr_coord_post_restart(); + +#if OPAL_ENABLE_CRDEBUG == 1 + /* + * If C/R enabled debugging, + * wait here for debugger to attach + */ + if( MPIR_debug_with_checkpoint ) { + MPIR_checkpoint_debugger_breakpoint(); + } +#endif } else if (OPAL_CRS_TERM == state ) { /* Do Continue Phase work in prep to terminate the application */ @@ -330,7 +402,7 @@ static int ompi_cr_coord_pre_continue(void) { opal_output_verbose(10, ompi_cr_output, "ompi_cr: coord_pre_continue: ompi_cr_coord_pre_continue()"); - if( ompi_cr_continue_like_restart ) { + if( orte_cr_continue_like_restart ) { /* Mimic ompi_cr_coord_pre_restart(); */ if( ORTE_SUCCESS != (ret = mca_pml.pml_ft_event(OPAL_CRS_CONTINUE))) { exit_status = ret; diff --git a/ompi/runtime/ompi_cr.h b/ompi/runtime/ompi_cr.h index 21b76d5b3d..56677a1164 100644 --- a/ompi/runtime/ompi_cr.h +++ b/ompi/runtime/ompi_cr.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -26,6 +26,7 @@ #define OMPI_CR_H #include "ompi_config.h" +#include "orte/runtime/orte_cr.h" BEGIN_C_DECLS @@ -49,11 +50,13 @@ BEGIN_C_DECLS */ OMPI_DECLSPEC extern int ompi_cr_output; - /* - * If one of the BTLs that shutdown require a full, clean rebuild of the - * point-to-point stack on 'continue' as well as 'restart'. - */ - OPAL_DECLSPEC extern bool ompi_cr_continue_like_restart; +#if OPAL_ENABLE_CRDEBUG == 1 + OMPI_DECLSPEC extern int MPIR_checkpointable; + OMPI_DECLSPEC extern char * MPIR_controller_hostname; + OMPI_DECLSPEC extern char * MPIR_checkpoint_command; + OMPI_DECLSPEC extern char * MPIR_restart_command; + OMPI_DECLSPEC extern char * MPIR_checkpoint_listing_command; +#endif END_C_DECLS diff --git a/ompi/tools/ompi_info/components.c b/ompi/tools/ompi_info/components.c index 0a88bc34eb..f68d22fff2 100644 --- a/ompi/tools/ompi_info/components.c +++ b/ompi/tools/ompi_info/components.c @@ -51,6 +51,8 @@ #if OPAL_ENABLE_FT_CR == 1 #include "opal/mca/crs/crs.h" #include "opal/mca/crs/base/base.h" +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" #endif #include "opal/runtime/opal.h" #include "opal/dss/dss.h" @@ -114,6 +116,8 @@ #if OPAL_ENABLE_FT_CR == 1 #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" #endif #if ORTE_ENABLE_SENSORS #include "orte/mca/sensor/sensor.h" @@ -330,6 +334,14 @@ void ompi_info_open_components(void) map->type = strdup("crs"); map->components = &opal_crs_base_components_available; opal_pointer_array_add(&component_map, map); + + if (OPAL_SUCCESS != opal_compress_base_open()) { + goto error; + } + map = OBJ_NEW(ompi_info_component_map_t); + map->type = strdup("compress"); + map->components = &opal_compress_base_components_available; + opal_pointer_array_add(&component_map, map); #endif /* OPAL's installdirs base open has already been called as part of @@ -460,6 +472,14 @@ void ompi_info_open_components(void) opal_pointer_array_add(&component_map, map); #if OPAL_ENABLE_FT_CR == 1 + if (ORTE_SUCCESS != orte_sstore_base_open()) { + goto error; + } + map = OBJ_NEW(ompi_info_component_map_t); + map->type = strdup("sstore"); + map->components = &orte_sstore_base_components_available; + opal_pointer_array_add(&component_map, map); + if (ORTE_SUCCESS != orte_snapc_base_open()) { goto error; } @@ -680,6 +700,7 @@ void ompi_info_close_components() #if !ORTE_DISABLE_FULL_SUPPORT #if OPAL_ENABLE_FT_CR == 1 (void) orte_snapc_base_close(); + (void) orte_sstore_base_close(); #endif (void) orte_filem_base_close(); (void) orte_iof_base_close(); diff --git a/ompi/tools/ompi_info/ompi_info.c b/ompi/tools/ompi_info/ompi_info.c index 36ab691f80..caf14ee3b9 100644 --- a/ompi/tools/ompi_info/ompi_info.c +++ b/ompi/tools/ompi_info/ompi_info.c @@ -37,6 +37,9 @@ #include "opal/class/opal_object.h" #include "opal/class/opal_pointer_array.h" #include "opal/runtime/opal.h" +#if OPAL_ENABLE_FT_CR == 1 +#include "opal/runtime/opal_cr.h" +#endif #include "opal/util/cmd_line.h" #include "opal/util/argv.h" #include "opal/mca/base/base.h" @@ -196,7 +199,9 @@ int main(int argc, char *argv[]) opal_pointer_array_add(&mca_types, "installdirs"); opal_pointer_array_add(&mca_types, "sysinfo"); #if OPAL_ENABLE_FT_CR == 1 + opal_cr_set_enabled(true); opal_pointer_array_add(&mca_types, "crs"); + opal_pointer_array_add(&mca_types, "compress"); #endif opal_pointer_array_add(&mca_types, "dpm"); opal_pointer_array_add(&mca_types, "pubsub"); @@ -228,6 +233,7 @@ int main(int argc, char *argv[]) opal_pointer_array_add(&mca_types, "routed"); opal_pointer_array_add(&mca_types, "plm"); #if OPAL_ENABLE_FT_CR == 1 + opal_pointer_array_add(&mca_types, "sstore"); opal_pointer_array_add(&mca_types, "snapc"); #endif #if ORTE_ENABLE_SENSORS diff --git a/ompi/tools/ompi_info/param.c b/ompi/tools/ompi_info/param.c index b30741f7e7..6d28f0ecb1 100644 --- a/ompi/tools/ompi_info/param.c +++ b/ompi/tools/ompi_info/param.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2006 The University of Tennessee and The University @@ -515,6 +515,7 @@ void ompi_info_do_config(bool want_all) char *wtime_support; char *symbol_visibility; char *ft_support; + char *crdebug_support; /* Do a little preprocessor trickery here to figure ompi_info_out the * tri-state of MPI_PARAM_CHECK (which will be either 0, 1, or * ompi_mpi_param_check). The preprocessor will only allow @@ -582,7 +583,10 @@ void ompi_info_do_config(bool want_all) asprintf(&ft_support, "%s (checkpoint thread: %s)", OPAL_ENABLE_FT ? "yes" : "no", OPAL_ENABLE_FT_THREAD ? "yes" : "no");; - + + asprintf(&crdebug_support, "%s", + OPAL_ENABLE_CRDEBUG ? "yes" : "no"); + /* output values */ ompi_info_out("Configured by", "config:user", OMPI_CONFIGURE_USER); ompi_info_out("Configured on", "config:timestamp", OMPI_CONFIGURE_DATE); @@ -833,7 +837,10 @@ void ompi_info_do_config(bool want_all) ompi_info_out("FT Checkpoint support", "options:ft_support", ft_support); free(ft_support); - + + ompi_info_out("C/R Enabled Debugging", "options:crdebug_support", crdebug_support); + free(crdebug_support); + ompi_info_out_int("MPI_MAX_PROCESSOR_NAME", "options:mpi-max-processor-name", MPI_MAX_PROCESSOR_NAME); ompi_info_out_int("MPI_MAX_ERROR_STRING", "options:mpi-max-error-string", diff --git a/ompi/tools/ortetools/Makefile.am b/ompi/tools/ortetools/Makefile.am index 06d402db5d..b48efb2739 100644 --- a/ompi/tools/ortetools/Makefile.am +++ b/ompi/tools/ortetools/Makefile.am @@ -1,5 +1,5 @@ # -# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # Copyright (c) 2004-2005 The University of Tennessee and The University @@ -39,6 +39,7 @@ install-exec-hook: if WANT_FT (cd $(DESTDIR)$(bindir); rm -f ompi-checkpoint$(EXEEXT); $(LN_S) orte-checkpoint$(EXEEXT) ompi-checkpoint$(EXEEXT)) (cd $(DESTDIR)$(bindir); rm -f ompi-restart$(EXEEXT); $(LN_S) orte-restart$(EXEEXT) ompi-restart$(EXEEXT)) + (cd $(DESTDIR)$(bindir); rm -f ompi-migrate$(EXEEXT); $(LN_S) orte-migrate$(EXEEXT) ompi-migrate$(EXEEXT)) endif uninstall-local: @@ -50,7 +51,8 @@ uninstall-local: $(DESTDIR)$(bindir)/ompi-top$(EXEEXT) if WANT_FT rm -f $(DESTDIR)$(bindir)/ompi-checkpoint$(EXEEXT) \ - $(DESTDIR)$(bindir)/ompi-restart$(EXEEXT) + $(DESTDIR)$(bindir)/ompi-restart$(EXEEXT) \ + $(DESTDIR)$(bindir)/ompi-migrate$(EXEEXT) endif endif # !ORTE_DISABLE_FULL_SUPPORT @@ -95,6 +97,12 @@ $(top_builddir)/orte/tools/orte-restart/orte-restart.1: ompi-restart.1: $(top_builddir)/orte/tools/orte-restart/orte-restart.1 cp -f $(top_builddir)/orte/tools/orte-restart/orte-restart.1 ompi-restart.1 +$(top_builddir)/orte/tools/orte-migrate/orte-migrate.1: + (cd $(top_builddir)/orte/tools/orte-migrate && $(MAKE) $(AM_MAKEFLAGS) orte-migrate.1) + +ompi-migrate.1: $(top_builddir)/orte/tools/orte-migrate/orte-migrate.1 + cp -f $(top_builddir)/orte/tools/orte-migrate/orte-migrate.1 ompi-migrate.1 + $(top_builddir)/orte/tools/orte-top/orte-top.1: (cd $(top_builddir)/orte/tools/orte-top && $(MAKE) $(AM_MAKEFLAGS) orte-top.1) diff --git a/opal/config/opal_configure_options.m4 b/opal/config/opal_configure_options.m4 index f8feaf5c16..6047874972 100644 --- a/opal/config/opal_configure_options.m4 +++ b/opal/config/opal_configure_options.m4 @@ -541,4 +541,27 @@ OPAL_WITH_OPTION_MIN_MAX_VALUE(datarep_string, 128, 64, 256) AC_ARG_WITH([libltdl], [AC_HELP_STRING([--with-libltdl(=DIR)], [Where to find libltdl (this option is ignored if --disable-dlopen is used). DIR can take one of three values: "internal", "external", or a valid directory name. "internal" (or no DIR value) forces Open MPI to use its internal copy of libltdl. "external" forces Open MPI to use an external installation of libltdl. Supplying a valid directory name also forces Open MPI to use an external installation of libltdl, and adds DIR/include, DIR/lib, and DIR/lib64 to the search path for headers and libraries.])]) + +# +# Checkpoint/restart enabled debugging +# +AC_MSG_CHECKING([if want checkpoint/restart enabled debugging option]) +AC_ARG_ENABLE([crdebug], + [AC_HELP_STRING([--enable-crdebug], + [enable checkpoint/restart debugging functionality (default: disabled)])]) + +if test "$ompi_want_ft" = "0"; then + ompi_want_prd=0 + AC_MSG_RESULT([Disabled (fault tolerance disabled --without-ft)]) +elif test "$enable_crdebug" = "yes"; then + ompi_want_prd=1 + AC_MSG_RESULT([Enabled]) +else + ompi_want_prd=0 + AC_MSG_RESULT([Disabled]) +fi + +AC_DEFINE_UNQUOTED([OPAL_ENABLE_CRDEBUG], [$ompi_want_prd], + [Whether we want checkpoint/restart enabled debugging functionality or not]) + ])dnl diff --git a/opal/mca/compress/Makefile.am b/opal/mca/compress/Makefile.am new file mode 100644 index 0000000000..22f2219a85 --- /dev/null +++ b/opal/mca/compress/Makefile.am @@ -0,0 +1,42 @@ +# +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +include $(top_srcdir)/Makefile.man-page-rules + +# main library setup +noinst_LTLIBRARIES = libmca_compress.la +libmca_compress_la_SOURCES = + +# header setup +nobase_opal_HEADERS = + +# local files +headers = compress.h +libmca_compress_la_SOURCES += $(headers) + +# Ensure that the man pages are rebuilt if the opal_config.h file +# changes; a "good enough" way to know if configure was run again (and +# therefore the release date or version may have changed) +$(nodist_man_MANS): $(top_builddir)/opal/include/opal_config.h + +# Conditionally install the header files +if WANT_INSTALL_HEADERS +nobase_opal_HEADERS += $(headers) +opaldir = $(includedir)/openmpi/opal/mca/compress +else +opaldir = $(includedir) +endif + +include base/Makefile.am + +distclean-local: + rm -f base/static-components.h + rm -f $(nodist_man_MANS) diff --git a/opal/mca/compress/base/Makefile.am b/opal/mca/compress/base/Makefile.am new file mode 100644 index 0000000000..bfada1c711 --- /dev/null +++ b/opal/mca/compress/base/Makefile.am @@ -0,0 +1,21 @@ +# +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +dist_pkgdata_DATA = base/help-opal-compress-base.txt + +headers += \ + base/base.h + +libmca_compress_la_SOURCES += \ + base/compress_base_open.c \ + base/compress_base_close.c \ + base/compress_base_select.c \ + base/compress_base_fns.c diff --git a/opal/mca/compress/base/base.h b/opal/mca/compress/base/base.h new file mode 100644 index 0000000000..77fc90cf82 --- /dev/null +++ b/opal/mca/compress/base/base.h @@ -0,0 +1,76 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#ifndef OPAL_COMPRESS_BASE_H +#define OPAL_COMPRESS_BASE_H + +#include "opal_config.h" +#include "opal/mca/compress/compress.h" +#include "opal/util/opal_environ.h" +#include "opal/runtime/opal_cr.h" + +/* + * Global functions for MCA overall COMPRESS + */ + +#if defined(c_plusplus) || defined(__cplusplus) +extern "C" { +#endif + + /** + * Initialize the COMPRESS MCA framework + * + * @retval OPAL_SUCCESS Upon success + * @retval OPAL_ERROR Upon failures + * + * This function is invoked during opal_init(); + */ + OPAL_DECLSPEC int opal_compress_base_open(void); + + /** + * Select an available component. + * + * @retval OPAL_SUCCESS Upon Success + * @retval OPAL_NOT_FOUND If no component can be selected + * @retval OPAL_ERROR Upon other failure + * + */ + OPAL_DECLSPEC int opal_compress_base_select(void); + + /** + * Finalize the COMPRESS MCA framework + * + * @retval OPAL_SUCCESS Upon success + * @retval OPAL_ERROR Upon failures + * + * This function is invoked during opal_finalize(); + */ + OPAL_DECLSPEC int opal_compress_base_close(void); + + /** + * Globals + */ + OPAL_DECLSPEC extern int opal_compress_base_output; + OPAL_DECLSPEC extern opal_list_t opal_compress_base_components_available; + OPAL_DECLSPEC extern opal_compress_base_component_t opal_compress_base_selected_component; + OPAL_DECLSPEC extern opal_compress_base_module_t opal_compress; + + /** + * + */ + OPAL_DECLSPEC int opal_compress_base_tar_create(char ** target); + OPAL_DECLSPEC int opal_compress_base_tar_extract(char ** target); + +#if defined(c_plusplus) || defined(__cplusplus) +} +#endif + +#endif /* OPAL_COMPRESS_BASE_H */ diff --git a/opal/mca/compress/base/compress_base_close.c b/opal/mca/compress/base/compress_base_close.c new file mode 100644 index 0000000000..691e333733 --- /dev/null +++ b/opal/mca/compress/base/compress_base_close.c @@ -0,0 +1,40 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#include +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/include/opal/constants.h" +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" + +int opal_compress_base_close(void) +{ + /* Compression currently only used with C/R */ + if( !opal_cr_is_enabled ) { + opal_output_verbose(10, opal_compress_base_output, + "compress:open: FT is not enabled, skipping!"); + return OPAL_SUCCESS; + } + + /* Call the component's finalize routine */ + if( NULL != opal_compress.finalize ) { + opal_compress.finalize(); + } + + /* Close all available modules that are open */ + mca_base_components_close(opal_compress_base_output, + &opal_compress_base_components_available, + NULL); + + return OPAL_SUCCESS; +} diff --git a/opal/mca/compress/base/compress_base_fns.c b/opal/mca/compress/base/compress_base_fns.c new file mode 100644 index 0000000000..71a9f7715b --- /dev/null +++ b/opal/mca/compress/base/compress_base_fns.c @@ -0,0 +1,142 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#include +#include +#if HAVE_SYS_TYPES_H +#include +#endif +#if HAVE_UNISTD_H +#include +#endif +#ifdef HAVE_FCNTL_H +#include +#endif /* HAVE_FCNTL_H */ +#ifdef HAVE_SYS_STAT_H +#include +#endif + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/include/opal/constants.h" +#include "opal/util/os_dirpath.h" +#include "opal/util/output.h" +#include "opal/util/argv.h" + +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" + +/****************** + * Local Function Defs + ******************/ + +/****************** + * Object stuff + ******************/ + +int opal_compress_base_tar_create(char ** target) +{ + int exit_status = OPAL_SUCCESS; + char *cmd = NULL; + char *tar_target = NULL; + char **argv = NULL; + pid_t child_pid = 0; + int status = 0; + + asprintf(&tar_target, "%s.tar", *target); + + child_pid = fork(); + if( 0 == child_pid ) { /* Child */ + asprintf(&cmd, "tar -cf %s %s", tar_target, *target); + + argv = opal_argv_split(cmd, ' '); + status = execvp(argv[0], argv); + + opal_output(0, "compress:base: Tar:: Failed to exec child [%s] status = %d\n", cmd, status); + exit(OPAL_ERROR); + } + else if(0 < child_pid) { + waitpid(child_pid, &status, 0); + + if( !WIFEXITED(status) ) { + exit_status = OPAL_ERROR; + goto cleanup; + } + + free(*target); + *target = strdup(tar_target); + } + else { + exit_status = OPAL_ERROR; + goto cleanup; + } + + cleanup: + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + if( NULL != tar_target ) { + free(tar_target); + tar_target = NULL; + } + + return exit_status; +} + +int opal_compress_base_tar_extract(char ** target) +{ + int exit_status = OPAL_SUCCESS; + char *cmd = NULL; + char **argv = NULL; + pid_t child_pid = 0; + int status = 0; + + child_pid = fork(); + if( 0 == child_pid ) { /* Child */ + asprintf(&cmd, "tar -xf %s", *target); + + argv = opal_argv_split(cmd, ' '); + status = execvp(argv[0], argv); + + opal_output(0, "compress:base: Tar:: Failed to exec child [%s] status = %d\n", cmd, status); + exit(OPAL_ERROR); + } + else if(0 < child_pid) { + waitpid(child_pid, &status, 0); + + if( !WIFEXITED(status) ) { + exit_status = OPAL_ERROR; + goto cleanup; + } + + /* Strip off the '.tar' */ + (*target)[strlen(*target)-4] = '\0'; + } + else { + exit_status = OPAL_ERROR; + goto cleanup; + } + + cleanup: + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + + return exit_status; +} + +/****************** + * Local Functions + ******************/ diff --git a/opal/mca/compress/base/compress_base_open.c b/opal/mca/compress/base/compress_base_open.c new file mode 100644 index 0000000000..f0efa6225d --- /dev/null +++ b/opal/mca/compress/base/compress_base_open.c @@ -0,0 +1,99 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#include +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/include/opal/constants.h" +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" +#include "opal/util/output.h" + +#include "opal/mca/compress/base/static-components.h" + +/* + * Globals + */ +int opal_compress_base_output = -1; +opal_compress_base_module_t opal_compress = { + NULL, /* init */ + NULL, /* finalize */ + NULL, /* compress */ + NULL, /* compress_nb */ + NULL, /* decompress */ + NULL /* decompress_nb */ +}; +opal_list_t opal_compress_base_components_available; +opal_compress_base_component_t opal_compress_base_selected_component; + +/** + * Function for finding and opening either all MCA components, + * or the one that was specifically requested via a MCA parameter. + */ +int opal_compress_base_open(void) +{ + int ret, exit_status = OPAL_SUCCESS; + int value; + char *str_value = NULL; + + /* Debugging/Verbose output */ + mca_base_param_reg_int_name("compress", + "base_verbose", + "Verbosity level of the COMPRESS framework", + false, false, + 0, &value); + if(0 != value) { + opal_compress_base_output = opal_output_open(NULL); + } else { + opal_compress_base_output = -1; + } + opal_output_set_verbosity(opal_compress_base_output, value); + + /* + * Which COMPRESS component to open + * - NULL or "" = auto-select + * - "none" = Empty component + * - ow. select that specific component + */ + mca_base_param_reg_string_name("compress", NULL, + "Which COMPRESS component to use (empty = auto-select)", + false, false, + NULL, &str_value); + + /* Compression currently only used with C/R */ + if( !opal_cr_is_enabled ) { + opal_output_verbose(10, opal_compress_base_output, + "compress:open: FT is not enabled, skipping!"); + return OPAL_SUCCESS; + } + + /* Open up all available components */ + if (OPAL_SUCCESS != (ret = mca_base_components_open("compress", + opal_compress_base_output, + mca_compress_base_static_components, + &opal_compress_base_components_available, + true)) ) { + if( OPAL_ERR_NOT_FOUND == ret && + NULL != str_value && + 0 == strncmp(str_value, "none", strlen("none")) ) { + exit_status = OPAL_SUCCESS; + } else { + exit_status = OPAL_ERROR; + } + } + + if( NULL != str_value ) { + free(str_value); + } + return exit_status; +} diff --git a/opal/mca/compress/base/compress_base_select.c b/opal/mca/compress/base/compress_base_select.c new file mode 100644 index 0000000000..f9478ed898 --- /dev/null +++ b/opal/mca/compress/base/compress_base_select.c @@ -0,0 +1,65 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#ifdef HAVE_UNISTD_H +#include "unistd.h" +#endif + +#include "opal/include/opal/constants.h" +#include "opal/util/output.h" +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" + +int opal_compress_base_select(void) +{ + int ret, exit_status = OPAL_SUCCESS; + opal_compress_base_component_t *best_component = NULL; + opal_compress_base_module_t *best_module = NULL; + + /* Compression currently only used with C/R */ + if( !opal_cr_is_enabled ) { + opal_output_verbose(10, opal_compress_base_output, + "compress:open: FT is not enabled, skipping!"); + return OPAL_SUCCESS; + } + + /* + * Select the best component + */ + if( OPAL_SUCCESS != mca_base_select("compress", opal_compress_base_output, + &opal_compress_base_components_available, + (mca_base_module_t **) &best_module, + (mca_base_component_t **) &best_component) ) { + /* This will only happen if no component was selected */ + exit_status = OPAL_ERROR; + goto cleanup; + } + + /* Save the winner */ + opal_compress_base_selected_component = *best_component; + opal_compress = *best_module; + + /* Initialize the winner */ + if (NULL != best_module) { + if (OPAL_SUCCESS != (ret = opal_compress.init()) ) { + exit_status = ret; + goto cleanup; + } + } + + cleanup: + return exit_status; +} diff --git a/opal/mca/compress/base/help-opal-compress-base.txt b/opal/mca/compress/base/help-opal-compress-base.txt new file mode 100644 index 0000000000..fb1f11273d --- /dev/null +++ b/opal/mca/compress/base/help-opal-compress-base.txt @@ -0,0 +1,13 @@ + -*- text -*- +# +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for Open PAL Compress framework. +# diff --git a/opal/mca/compress/bzip/Makefile.am b/opal/mca/compress/bzip/Makefile.am new file mode 100644 index 0000000000..e0fc5151d9 --- /dev/null +++ b/opal/mca/compress/bzip/Makefile.am @@ -0,0 +1,40 @@ +# +# Copyright (c) 2004-2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +AM_CPPFLAGS = \ + $(LTDLINCL) + +dist_pkgdata_DATA = help-opal-compress-bzip.txt + +sources = \ + compress_bzip.h \ + compress_bzip_component.c \ + compress_bzip_module.c + +# Make the output library in this directory, and name it either +# mca__.la (for DSO builds) or libmca__.la +# (for static builds). + +if OMPI_BUILD_compress_bzip_DSO +component_noinst = +component_install = mca_compress_bzip.la +else +component_noinst = libmca_compress_bzip.la +component_install = +endif + +mcacomponentdir = $(pkglibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_compress_bzip_la_SOURCES = $(sources) +mca_compress_bzip_la_LDFLAGS = -module -avoid-version + +noinst_LTLIBRARIES = $(component_noinst) +libmca_compress_bzip_la_SOURCES = $(sources) +libmca_compress_bzip_la_LDFLAGS = -module -avoid-version diff --git a/opal/mca/compress/bzip/compress_bzip.h b/opal/mca/compress/bzip/compress_bzip.h new file mode 100644 index 0000000000..448430c263 --- /dev/null +++ b/opal/mca/compress/bzip/compress_bzip.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * + * BZIP COMPRESS component + * + * Uses the bzip library + */ + +#ifndef MCA_COMPRESS_BZIP_EXPORT_H +#define MCA_COMPRESS_BZIP_EXPORT_H + +#include "opal_config.h" + +#include "opal/util/output.h" + +#include "opal/mca/mca.h" +#include "opal/mca/compress/compress.h" + +#if defined(c_plusplus) || defined(__cplusplus) +extern "C" { +#endif + + /* + * Local Component structures + */ + struct opal_compress_bzip_component_t { + opal_compress_base_component_t super; /** Base COMPRESS component */ + + }; + typedef struct opal_compress_bzip_component_t opal_compress_bzip_component_t; + OPAL_MODULE_DECLSPEC extern opal_compress_bzip_component_t mca_compress_bzip_component; + + int opal_compress_bzip_component_query(mca_base_module_t **module, int *priority); + + /* + * Module functions + */ + int opal_compress_bzip_module_init(void); + int opal_compress_bzip_module_finalize(void); + + /* + * Actual funcationality + */ + int opal_compress_bzip_compress(char *fname, char **cname, char **postfix); + int opal_compress_bzip_compress_nb(char *fname, char **cname, char **postfix, pid_t *child_pid); + int opal_compress_bzip_decompress(char *cname, char **fname); + int opal_compress_bzip_decompress_nb(char *cname, char **fname, pid_t *child_pid); + +#if defined(c_plusplus) || defined(__cplusplus) +} +#endif + +#endif /* MCA_COMPRESS_BZIP_EXPORT_H */ diff --git a/opal/mca/compress/bzip/compress_bzip_component.c b/opal/mca/compress/bzip/compress_bzip_component.c new file mode 100644 index 0000000000..d301cbbe40 --- /dev/null +++ b/opal/mca/compress/bzip/compress_bzip_component.c @@ -0,0 +1,138 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#include "opal/constants.h" +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" +#include "compress_bzip.h" + +/* + * Public string for version number + */ +const char *opal_compress_bzip_component_version_string = +"OPAL COMPRESS bzip MCA component version " OPAL_VERSION; + +/* + * Local functionality + */ +static int compress_bzip_open(void); +static int compress_bzip_close(void); + +/* + * Instantiate the public struct with all of our public information + * and pointer to our public functions in it + */ +opal_compress_bzip_component_t mca_compress_bzip_component = { + /* First do the base component stuff */ + { + /* Handle the general mca_component_t struct containing + * meta information about the component itbzip + */ + { + OPAL_COMPRESS_BASE_VERSION_2_0_0, + + /* Component name and version */ + "bzip", + OPAL_MAJOR_VERSION, + OPAL_MINOR_VERSION, + OPAL_RELEASE_VERSION, + + /* Component open and close functions */ + compress_bzip_open, + compress_bzip_close, + opal_compress_bzip_component_query + }, + { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + /* Verbosity level */ + 0, + /* opal_output handler */ + -1, + /* Default priority */ + 10 + } +}; + +/* + * Bzip module + */ +static opal_compress_base_module_t loc_module = { + /** Initialization Function */ + opal_compress_bzip_module_init, + /** Finalization Function */ + opal_compress_bzip_module_finalize, + + /** Compress Function */ + opal_compress_bzip_compress, + opal_compress_bzip_compress_nb, + + /** Decompress Function */ + opal_compress_bzip_decompress, + opal_compress_bzip_decompress_nb +}; + +static int compress_bzip_open(void) +{ + mca_base_param_reg_int(&mca_compress_bzip_component.super.base_version, + "priority", + "Priority of the COMPRESS bzip component", + false, false, + mca_compress_bzip_component.super.priority, + &mca_compress_bzip_component.super.priority); + + mca_base_param_reg_int(&mca_compress_bzip_component.super.base_version, + "verbose", + "Verbose level for the COMPRESS bzip component", + false, false, + mca_compress_bzip_component.super.verbose, + &mca_compress_bzip_component.super.verbose); + /* If there is a custom verbose level for this component than use it + * otherwise take our parents level and output channel + */ + if ( 0 != mca_compress_bzip_component.super.verbose) { + mca_compress_bzip_component.super.output_handle = opal_output_open(NULL); + opal_output_set_verbosity(mca_compress_bzip_component.super.output_handle, + mca_compress_bzip_component.super.verbose); + } else { + mca_compress_bzip_component.super.output_handle = opal_compress_base_output; + } + + /* + * Debug output + */ + opal_output_verbose(10, mca_compress_bzip_component.super.output_handle, + "compress:bzip: open()"); + opal_output_verbose(20, mca_compress_bzip_component.super.output_handle, + "compress:bzip: open: priority = %d", + mca_compress_bzip_component.super.priority); + opal_output_verbose(20, mca_compress_bzip_component.super.output_handle, + "compress:bzip: open: verbosity = %d", + mca_compress_bzip_component.super.verbose); + return OPAL_SUCCESS; +} + +static int compress_bzip_close(void) +{ + return OPAL_SUCCESS; +} + +int opal_compress_bzip_component_query(mca_base_module_t **module, int *priority) +{ + *module = (mca_base_module_t *)&loc_module; + *priority = mca_compress_bzip_component.super.priority; + + return OPAL_SUCCESS; +} + diff --git a/opal/mca/compress/bzip/compress_bzip_module.c b/opal/mca/compress/bzip/compress_bzip_module.c new file mode 100644 index 0000000000..71f09275d5 --- /dev/null +++ b/opal/mca/compress/bzip/compress_bzip_module.c @@ -0,0 +1,247 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#include +#include +#include +#include +#if HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/util/opal_environ.h" +#include "opal/util/output.h" +#include "opal/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/opal_environ.h" + +#include "opal/constants.h" +#include "opal/mca/base/mca_base_param.h" +#include "opal/util/basename.h" + +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" +#include "opal/runtime/opal_cr.h" + +#include "compress_bzip.h" + +static bool is_directory(char *fname ); + +int opal_compress_bzip_module_init(void) +{ + return OPAL_SUCCESS; +} + +int opal_compress_bzip_module_finalize(void) +{ + return OPAL_SUCCESS; +} + +int opal_compress_bzip_compress(char * fname, char **cname, char **postfix) +{ + int child_pid = 0; + int status = 0; + + opal_output_verbose(10, mca_compress_bzip_component.super.output_handle, + "compress:bzip: compress(%s)", + fname); + + opal_compress_bzip_compress_nb(fname, cname, postfix, &child_pid); + waitpid(child_pid, &status, 0); + + if( WIFEXITED(status) ) { + return OPAL_SUCCESS; + } else { + return OPAL_ERROR; + } +} + +int opal_compress_bzip_compress_nb(char * fname, char **cname, char **postfix, pid_t *child_pid) +{ + char * cmd = NULL; + char **argv = NULL; + char * base_fname = NULL; + char * dir_fname = NULL; + int status; + bool is_dir; + + is_dir = is_directory(fname); + + *child_pid = fork(); + if( *child_pid == 0 ) { /* Child */ + + dir_fname = opal_dirname(fname); + base_fname = opal_basename(fname); + + chdir(dir_fname); + + if( is_dir ) { +#if 0 + opal_compress_base_tar_create(&base_fname); + asprintf(cname, "%s.bz2", base_fname); + asprintf(&cmd, "bzip2 %s", base_fname); +#else + asprintf(cname, "%s.tar.bz2", base_fname); + asprintf(&cmd, "tar -jcf %s %s", *cname, base_fname); +#endif + } else { + asprintf(cname, "%s.bz2", base_fname); + asprintf(&cmd, "bzip2 %s", base_fname); + } + + opal_output_verbose(10, mca_compress_bzip_component.super.output_handle, + "compress:bzip: compress_nb(%s -> [%s])", + fname, *cname); + opal_output_verbose(10, mca_compress_bzip_component.super.output_handle, + "compress:bzip: compress_nb() command [%s]", + cmd); + + argv = opal_argv_split(cmd, ' '); + status = execvp(argv[0], argv); + + opal_output(0, "compress:bzip: compress_nb: Failed to exec child [%s] status = %d\n", cmd, status); + exit(OPAL_ERROR); + } + else if( *child_pid > 0 ) { + if( is_dir ) { + *postfix = strdup(".tar.bz2"); + } else { + *postfix = strdup(".bz2"); + } + asprintf(cname, "%s%s", fname, *postfix); + } + else { + return OPAL_ERROR; + } + + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + + return OPAL_SUCCESS; +} + +int opal_compress_bzip_decompress(char * cname, char **fname) +{ + int child_pid = 0; + int status = 0; + + opal_output_verbose(10, mca_compress_bzip_component.super.output_handle, + "compress:bzip: decompress(%s)", + cname); + + opal_compress_bzip_decompress_nb(cname, fname, &child_pid); + waitpid(child_pid, &status, 0); + + if( WIFEXITED(status) ) { + return OPAL_SUCCESS; + } else { + return OPAL_ERROR; + } +} + +int opal_compress_bzip_decompress_nb(char * cname, char **fname, pid_t *child_pid) +{ + char * cmd = NULL; + char **argv = NULL; + char * dir_cname = NULL; + pid_t loc_pid = 0; + int status; + bool is_tar; + + if( 0 == strncmp(&(cname[strlen(cname)-8]), ".tar.bz2", strlen(".tar.bz2")) ) { + is_tar = true; + } + + *fname = strdup(cname); + if( is_tar ) { + (*fname)[strlen(cname)-8] = '\0'; + } else { + (*fname)[strlen(cname)-4] = '\0'; + } + + opal_output_verbose(10, mca_compress_bzip_component.super.output_handle, + "compress:bzip: decompress_nb(%s -> [%s])", + cname, *fname); + + *child_pid = fork(); + if( *child_pid == 0 ) { /* Child */ + dir_cname = opal_dirname(cname); + + chdir(dir_cname); + + /* Fork(bunzip) */ + loc_pid = fork(); + if( loc_pid == 0 ) { /* Child */ + asprintf(&cmd, "bunzip2 %s", cname); + + opal_output_verbose(10, mca_compress_bzip_component.super.output_handle, + "compress:bzip: decompress_nb() command [%s]", + cmd); + + argv = opal_argv_split(cmd, ' '); + status = execvp(argv[0], argv); + + opal_output(0, "compress:bzip: decompress_nb: Failed to exec child [%s] status = %d\n", cmd, status); + exit(OPAL_ERROR); + } + else if( loc_pid > 0 ) { /* Parent */ + waitpid(loc_pid, &status, 0); + if( !WIFEXITED(status) ) { + opal_output(0, "compress:bzip: decompress_nb: Failed to bunzip the file [%s] status = %d\n", cname, status); + exit(OPAL_ERROR); + } + } + else { + exit(OPAL_ERROR); + } + + /* tar_decompress */ + if( is_tar ) { + /* Strip off '.bz2' leaving just '.tar' */ + cname[strlen(cname)-4] = '\0'; + opal_compress_base_tar_extract(&cname); + } + + /* Once this child is done, then directly exit */ + exit(OPAL_SUCCESS); + } + else if( *child_pid > 0 ) { + ; + } + else { + return OPAL_ERROR; + } + + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + + return OPAL_SUCCESS; +} + +static bool is_directory(char *fname ) { + struct stat file_status; + int rc; + + if(0 != (rc = stat(fname, &file_status) ) ) { + return false; + } + if(S_ISDIR(file_status.st_mode)) { + return true; + } + + return false; +} diff --git a/opal/mca/compress/bzip/configure.params b/opal/mca/compress/bzip/configure.params new file mode 100644 index 0000000000..d10e0f72b5 --- /dev/null +++ b/opal/mca/compress/bzip/configure.params @@ -0,0 +1,13 @@ +# -*- shell-script -*- +# +# Copyright (c) 2004-2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_INIT_FILE=compress_bzip_component.c +PARAM_CONFIG_FILES="Makefile" diff --git a/opal/mca/compress/bzip/help-opal-compress-bzip.txt b/opal/mca/compress/bzip/help-opal-compress-bzip.txt new file mode 100644 index 0000000000..fb1f11273d --- /dev/null +++ b/opal/mca/compress/bzip/help-opal-compress-bzip.txt @@ -0,0 +1,13 @@ + -*- text -*- +# +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for Open PAL Compress framework. +# diff --git a/opal/mca/compress/compress.h b/opal/mca/compress/compress.h new file mode 100644 index 0000000000..f377d04a21 --- /dev/null +++ b/opal/mca/compress/compress.h @@ -0,0 +1,135 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +/** + * @file + * + * Compression Framework + * + * General Description: + * + * The OPAL Compress framework has been created to provide an abstract interface + * to the compression agent library on the host machine. This fromework is useful + * when distributing files that can be compressed before sending to dimish the + * load on the network. + * + */ + +#ifndef MCA_COMPRESS_H +#define MCA_COMPRESS_H + +#include "opal_config.h" +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/class/opal_object.h" + +#if defined(c_plusplus) || defined(__cplusplus) +extern "C" { +#endif + +/** + * Module initialization function. + * Returns OPAL_SUCCESS + */ +typedef int (*opal_compress_base_module_init_fn_t) + (void); + +/** + * Module finalization function. + * Returns OPAL_SUCCESS + */ +typedef int (*opal_compress_base_module_finalize_fn_t) + (void); + +/** + * Compress the file provided + * + * Arguments: + * fname = Filename to compress + * cname = Compressed filename + * postfix = postfix added to filename to create compressed filename + * Returns: + * OPAL_SUCCESS on success, ow OPAL_ERROR + */ +typedef int (*opal_compress_base_module_compress_fn_t) + (char * fname, char **cname, char **postfix); + +typedef int (*opal_compress_base_module_compress_nb_fn_t) + (char * fname, char **cname, char **postfix, pid_t *child_pid); + +/** + * Decompress the file provided + * + * Arguments: + * fname = Filename to compress + * cname = Compressed filename + * Returns: + * OPAL_SUCCESS on success, ow OPAL_ERROR + */ +typedef int (*opal_compress_base_module_decompress_fn_t) + (char * cname, char **fname); +typedef int (*opal_compress_base_module_decompress_nb_fn_t) + (char * cname, char **fname, pid_t *child_pid); + +/** + * Structure for COMPRESS components. + */ +struct opal_compress_base_component_2_0_0_t { + /** MCA base component */ + mca_base_component_t base_version; + /** MCA base data */ + mca_base_component_data_t base_data; + + /** Verbosity Level */ + int verbose; + /** Output Handle for opal_output */ + int output_handle; + /** Default Priority */ + int priority; +}; +typedef struct opal_compress_base_component_2_0_0_t opal_compress_base_component_2_0_0_t; +typedef struct opal_compress_base_component_2_0_0_t opal_compress_base_component_t; + +/** + * Structure for COMPRESS modules + */ +struct opal_compress_base_module_1_0_0_t { + /** Initialization Function */ + opal_compress_base_module_init_fn_t init; + /** Finalization Function */ + opal_compress_base_module_finalize_fn_t finalize; + + /** Compress interface */ + opal_compress_base_module_compress_fn_t compress; + opal_compress_base_module_compress_nb_fn_t compress_nb; + + /** Decompress Interface */ + opal_compress_base_module_decompress_fn_t decompress; + opal_compress_base_module_decompress_nb_fn_t decompress_nb; +}; +typedef struct opal_compress_base_module_1_0_0_t opal_compress_base_module_1_0_0_t; +typedef struct opal_compress_base_module_1_0_0_t opal_compress_base_module_t; + +OPAL_DECLSPEC extern opal_compress_base_module_t opal_compress; + +/** + * Macro for use in components that are of type COMPRESS + */ +#define OPAL_COMPRESS_BASE_VERSION_2_0_0 \ + MCA_BASE_VERSION_2_0_0, \ + "compress", 2, 0, 0 + +#if defined(c_plusplus) || defined(__cplusplus) +} +#endif + +#endif /* OPAL_COMPRESS_H */ + diff --git a/opal/mca/compress/gzip/Makefile.am b/opal/mca/compress/gzip/Makefile.am new file mode 100644 index 0000000000..0f26c0151b --- /dev/null +++ b/opal/mca/compress/gzip/Makefile.am @@ -0,0 +1,40 @@ +# +# Copyright (c) 2004-2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +AM_CPPFLAGS = \ + $(LTDLINCL) + +dist_pkgdata_DATA = help-opal-compress-gzip.txt + +sources = \ + compress_gzip.h \ + compress_gzip_component.c \ + compress_gzip_module.c + +# Make the output library in this directory, and name it either +# mca__.la (for DSO builds) or libmca__.la +# (for static builds). + +if OMPI_BUILD_compress_gzip_DSO +component_noinst = +component_install = mca_compress_gzip.la +else +component_noinst = libmca_compress_gzip.la +component_install = +endif + +mcacomponentdir = $(pkglibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_compress_gzip_la_SOURCES = $(sources) +mca_compress_gzip_la_LDFLAGS = -module -avoid-version + +noinst_LTLIBRARIES = $(component_noinst) +libmca_compress_gzip_la_SOURCES = $(sources) +libmca_compress_gzip_la_LDFLAGS = -module -avoid-version diff --git a/opal/mca/compress/gzip/compress_gzip.h b/opal/mca/compress/gzip/compress_gzip.h new file mode 100644 index 0000000000..d470cbae2f --- /dev/null +++ b/opal/mca/compress/gzip/compress_gzip.h @@ -0,0 +1,63 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * + * GZIP COMPRESS component + * + * Uses the gzip library + */ + +#ifndef MCA_COMPRESS_GZIP_EXPORT_H +#define MCA_COMPRESS_GZIP_EXPORT_H + +#include "opal_config.h" + +#include "opal/util/output.h" + +#include "opal/mca/mca.h" +#include "opal/mca/compress/compress.h" + +#if defined(c_plusplus) || defined(__cplusplus) +extern "C" { +#endif + + /* + * Local Component structures + */ + struct opal_compress_gzip_component_t { + opal_compress_base_component_t super; /** Base COMPRESS component */ + + }; + typedef struct opal_compress_gzip_component_t opal_compress_gzip_component_t; + OPAL_MODULE_DECLSPEC extern opal_compress_gzip_component_t mca_compress_gzip_component; + + int opal_compress_gzip_component_query(mca_base_module_t **module, int *priority); + + /* + * Module functions + */ + int opal_compress_gzip_module_init(void); + int opal_compress_gzip_module_finalize(void); + + /* + * Actual funcationality + */ + int opal_compress_gzip_compress(char *fname, char **cname, char **postfix); + int opal_compress_gzip_compress_nb(char *fname, char **cname, char **postfix, pid_t *child_pid); + int opal_compress_gzip_decompress(char *cname, char **fname); + int opal_compress_gzip_decompress_nb(char *cname, char **fname, pid_t *child_pid); + +#if defined(c_plusplus) || defined(__cplusplus) +} +#endif + +#endif /* MCA_COMPRESS_GZIP_EXPORT_H */ diff --git a/opal/mca/compress/gzip/compress_gzip_component.c b/opal/mca/compress/gzip/compress_gzip_component.c new file mode 100644 index 0000000000..685f411ea3 --- /dev/null +++ b/opal/mca/compress/gzip/compress_gzip_component.c @@ -0,0 +1,138 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#include "opal/constants.h" +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" +#include "compress_gzip.h" + +/* + * Public string for version number + */ +const char *opal_compress_gzip_component_version_string = +"OPAL COMPRESS gzip MCA component version " OPAL_VERSION; + +/* + * Local functionality + */ +static int compress_gzip_open(void); +static int compress_gzip_close(void); + +/* + * Instantiate the public struct with all of our public information + * and pointer to our public functions in it + */ +opal_compress_gzip_component_t mca_compress_gzip_component = { + /* First do the base component stuff */ + { + /* Handle the general mca_component_t struct containing + * meta information about the component itgzip + */ + { + OPAL_COMPRESS_BASE_VERSION_2_0_0, + + /* Component name and version */ + "gzip", + OPAL_MAJOR_VERSION, + OPAL_MINOR_VERSION, + OPAL_RELEASE_VERSION, + + /* Component open and close functions */ + compress_gzip_open, + compress_gzip_close, + opal_compress_gzip_component_query + }, + { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + /* Verbosity level */ + 0, + /* opal_output handler */ + -1, + /* Default priority */ + 15 + } +}; + +/* + * Gzip module + */ +static opal_compress_base_module_t loc_module = { + /** Initialization Function */ + opal_compress_gzip_module_init, + /** Finalization Function */ + opal_compress_gzip_module_finalize, + + /** Compress Function */ + opal_compress_gzip_compress, + opal_compress_gzip_compress_nb, + + /** Decompress Function */ + opal_compress_gzip_decompress, + opal_compress_gzip_decompress_nb +}; + +static int compress_gzip_open(void) +{ + mca_base_param_reg_int(&mca_compress_gzip_component.super.base_version, + "priority", + "Priority of the COMPRESS gzip component", + false, false, + mca_compress_gzip_component.super.priority, + &mca_compress_gzip_component.super.priority); + + mca_base_param_reg_int(&mca_compress_gzip_component.super.base_version, + "verbose", + "Verbose level for the COMPRESS gzip component", + false, false, + mca_compress_gzip_component.super.verbose, + &mca_compress_gzip_component.super.verbose); + /* If there is a custom verbose level for this component than use it + * otherwise take our parents level and output channel + */ + if ( 0 != mca_compress_gzip_component.super.verbose) { + mca_compress_gzip_component.super.output_handle = opal_output_open(NULL); + opal_output_set_verbosity(mca_compress_gzip_component.super.output_handle, + mca_compress_gzip_component.super.verbose); + } else { + mca_compress_gzip_component.super.output_handle = opal_compress_base_output; + } + + /* + * Debug output + */ + opal_output_verbose(10, mca_compress_gzip_component.super.output_handle, + "compress:gzip: open()"); + opal_output_verbose(20, mca_compress_gzip_component.super.output_handle, + "compress:gzip: open: priority = %d", + mca_compress_gzip_component.super.priority); + opal_output_verbose(20, mca_compress_gzip_component.super.output_handle, + "compress:gzip: open: verbosity = %d", + mca_compress_gzip_component.super.verbose); + return OPAL_SUCCESS; +} + +static int compress_gzip_close(void) +{ + return OPAL_SUCCESS; +} + +int opal_compress_gzip_component_query(mca_base_module_t **module, int *priority) +{ + *module = (mca_base_module_t *)&loc_module; + *priority = mca_compress_gzip_component.super.priority; + + return OPAL_SUCCESS; +} + diff --git a/opal/mca/compress/gzip/compress_gzip_module.c b/opal/mca/compress/gzip/compress_gzip_module.c new file mode 100644 index 0000000000..86f63aff26 --- /dev/null +++ b/opal/mca/compress/gzip/compress_gzip_module.c @@ -0,0 +1,250 @@ +/* + * Copyright (c) 2004-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "opal_config.h" + +#include +#include +#include +#include +#if HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/util/opal_environ.h" +#include "opal/util/output.h" +#include "opal/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/opal_environ.h" + +#include "opal/constants.h" +#include "opal/mca/base/mca_base_param.h" +#include "opal/util/basename.h" + +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" +#include "opal/runtime/opal_cr.h" + +#include "compress_gzip.h" + +static bool is_directory(char *fname ); + +int opal_compress_gzip_module_init(void) +{ + return OPAL_SUCCESS; +} + +int opal_compress_gzip_module_finalize(void) +{ + return OPAL_SUCCESS; +} + +int opal_compress_gzip_compress(char * fname, char **cname, char **postfix) +{ + int child_pid = 0; + int status = 0; + + opal_output_verbose(10, mca_compress_gzip_component.super.output_handle, + "compress:gzip: compress(%s)", + fname); + + opal_compress_gzip_compress_nb(fname, cname, postfix, &child_pid); + waitpid(child_pid, &status, 0); + + if( WIFEXITED(status) ) { + return OPAL_SUCCESS; + } else { + return OPAL_ERROR; + } +} + +int opal_compress_gzip_compress_nb(char * fname, char **cname, char **postfix, pid_t *child_pid) +{ + char * cmd = NULL; + char **argv = NULL; + char * base_fname = NULL; + char * dir_fname = NULL; + int status; + bool is_dir; + + is_dir = is_directory(fname); + + *child_pid = fork(); + if( *child_pid == 0 ) { /* Child */ + + dir_fname = opal_dirname(fname); + base_fname = opal_basename(fname); + + chdir(dir_fname); + + if( is_dir ) { +#if 0 + opal_compress_base_tar_create(&base_fname); + asprintf(cname, "%s.gz", base_fname); + asprintf(&cmd, "gzip %s", base_fname); +#else + asprintf(cname, "%s.tar.gz", base_fname); + asprintf(&cmd, "tar -zcf %s %s", *cname, base_fname); +#endif + } else { + asprintf(cname, "%s.gz", base_fname); + asprintf(&cmd, "gzip %s", base_fname); + } + + opal_output_verbose(10, mca_compress_gzip_component.super.output_handle, + "compress:gzip: compress_nb(%s -> [%s])", + fname, *cname); + opal_output_verbose(10, mca_compress_gzip_component.super.output_handle, + "compress:gzip: compress_nb() command [%s]", + cmd); + + argv = opal_argv_split(cmd, ' '); + status = execvp(argv[0], argv); + + opal_output(0, "compress:gzip: compress_nb: Failed to exec child [%s] status = %d\n", cmd, status); + exit(OPAL_ERROR); + } + else if( *child_pid > 0 ) { + if( is_dir ) { + *postfix = strdup(".tar.gz"); + } else { + *postfix = strdup(".gz"); + } + asprintf(cname, "%s%s", fname, *postfix); + + } + else { + return OPAL_ERROR; + } + + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + + return OPAL_SUCCESS; +} + +int opal_compress_gzip_decompress(char * cname, char **fname) +{ + int child_pid = 0; + int status = 0; + + opal_output_verbose(10, mca_compress_gzip_component.super.output_handle, + "compress:gzip: decompress(%s)", + cname); + + opal_compress_gzip_decompress_nb(cname, fname, &child_pid); + waitpid(child_pid, &status, 0); + + if( WIFEXITED(status) ) { + return OPAL_SUCCESS; + } else { + return OPAL_ERROR; + } +} + +int opal_compress_gzip_decompress_nb(char * cname, char **fname, pid_t *child_pid) +{ + char * cmd = NULL; + char **argv = NULL; + char * dir_cname = NULL; + pid_t loc_pid = 0; + int status; + bool is_tar; + + if( 0 == strncmp(&(cname[strlen(cname)-7]), ".tar.gz", strlen(".tar.gz")) ) { + is_tar = true; + } + + *fname = strdup(cname); + if( is_tar ) { + /* Strip off '.tar.gz' */ + (*fname)[strlen(cname)-7] = '\0'; + } else { + /* Strip off '.gz' */ + (*fname)[strlen(cname)-3] = '\0'; + } + + opal_output_verbose(10, mca_compress_gzip_component.super.output_handle, + "compress:gzip: decompress_nb(%s -> [%s])", + cname, *fname); + + *child_pid = fork(); + if( *child_pid == 0 ) { /* Child */ + dir_cname = opal_dirname(cname); + + chdir(dir_cname); + + /* Fork(gunzip) */ + loc_pid = fork(); + if( loc_pid == 0 ) { /* Child */ + asprintf(&cmd, "gunzip %s", cname); + + opal_output_verbose(10, mca_compress_gzip_component.super.output_handle, + "compress:gzip: decompress_nb() command [%s]", + cmd); + + argv = opal_argv_split(cmd, ' '); + status = execvp(argv[0], argv); + + opal_output(0, "compress:gzip: decompress_nb: Failed to exec child [%s] status = %d\n", cmd, status); + exit(OPAL_ERROR); + } + else if( loc_pid > 0 ) { /* Parent */ + waitpid(loc_pid, &status, 0); + if( !WIFEXITED(status) ) { + opal_output(0, "compress:gzip: decompress_nb: Failed to bunzip the file [%s] status = %d\n", cname, status); + exit(OPAL_ERROR); + } + } + else { + exit(OPAL_ERROR); + } + + /* tar_decompress */ + if( is_tar ) { + /* Strip off '.gz' leaving just '.tar' */ + cname[strlen(cname)-3] = '\0'; + opal_compress_base_tar_extract(&cname); + } + + /* Once this child is done, then directly exit */ + exit(OPAL_SUCCESS); + } + else if( *child_pid > 0 ) { + ; + } + else { + return OPAL_ERROR; + } + + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + + return OPAL_SUCCESS; +} + +static bool is_directory(char *fname ) { + struct stat file_status; + int rc; + + if(0 != (rc = stat(fname, &file_status) ) ) { + return false; + } + if(S_ISDIR(file_status.st_mode)) { + return true; + } + + return false; +} diff --git a/opal/mca/compress/gzip/configure.params b/opal/mca/compress/gzip/configure.params new file mode 100644 index 0000000000..d3cfd4ae7a --- /dev/null +++ b/opal/mca/compress/gzip/configure.params @@ -0,0 +1,13 @@ +# -*- shell-script -*- +# +# Copyright (c) 2004-2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_INIT_FILE=compress_gzip_component.c +PARAM_CONFIG_FILES="Makefile" diff --git a/opal/mca/compress/gzip/help-opal-compress-gzip.txt b/opal/mca/compress/gzip/help-opal-compress-gzip.txt new file mode 100644 index 0000000000..fb1f11273d --- /dev/null +++ b/opal/mca/compress/gzip/help-opal-compress-gzip.txt @@ -0,0 +1,13 @@ + -*- text -*- +# +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for Open PAL Compress framework. +# diff --git a/opal/mca/crs/base/base.h b/opal/mca/crs/base/base.h index 6aeabec261..f89fc97182 100644 --- a/opal/mca/crs/base/base.h +++ b/opal/mca/crs/base/base.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -23,6 +23,7 @@ #include "opal_config.h" #include "opal/mca/crs/crs.h" #include "opal/util/opal_environ.h" +#include "opal/runtime/opal_cr.h" /* * Global functions for MCA overall CRS @@ -32,7 +33,7 @@ BEGIN_C_DECLS /* Some local strings to use genericly with the local metadata file */ #define CRS_METADATA_BASE ("# ") -#define CRS_METADATA_COMP ("# Component: ") +#define CRS_METADATA_COMP ("# OPAL CRS Component: ") #define CRS_METADATA_PID ("# PID: ") #define CRS_METADATA_CONTEXT ("# CONTEXT: ") #define CRS_METADATA_MKDIR ("# MKDIR: ") @@ -71,35 +72,25 @@ BEGIN_C_DECLS /** * Globals */ -#define opal_crs_base_metadata_filename (strdup("snapshot_meta.data")) - OPAL_DECLSPEC extern int opal_crs_base_output; OPAL_DECLSPEC extern opal_list_t opal_crs_base_components_available; OPAL_DECLSPEC extern opal_crs_base_component_t opal_crs_base_selected_component; OPAL_DECLSPEC extern opal_crs_base_module_t opal_crs; - OPAL_DECLSPEC extern char * opal_crs_base_snapshot_dir; /** * Some utility functions */ OPAL_DECLSPEC char * opal_crs_base_state_str(opal_crs_state_type_t state); - OPAL_DECLSPEC char * opal_crs_base_unique_snapshot_name(pid_t pid); - OPAL_DECLSPEC int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** component_name, int *prev_pid); - OPAL_DECLSPEC int opal_crs_base_init_snapshot_directory(opal_crs_base_snapshot_t *snapshot); - OPAL_DECLSPEC char * opal_crs_base_get_snapshot_directory(char *uniq_snapshot_name); + /* + * Extract the expected component and pid from the metadata + */ + OPAL_DECLSPEC int opal_crs_base_extract_expected_component(FILE *metadata, char ** component_name, int *prev_pid); /* * Read a token to the metadata file - * NULL can be passed for snapshot_loc if nit_snapshot_directory has been called. */ - OPAL_DECLSPEC int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***value); - - /* - * Write a token to the metadata file - * NULL can be passed for snapshot_loc if nit_snapshot_directory has been called. - */ - OPAL_DECLSPEC int opal_crs_base_metadata_write_token(char *snapshot_loc, char * token, char *value); + OPAL_DECLSPEC int opal_crs_base_metadata_read_token(FILE *metadata, char * token, char ***value); /* * Register a file for cleanup. @@ -122,6 +113,24 @@ BEGIN_C_DECLS */ OPAL_DECLSPEC int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target); + /* + * CRS self application interface functions + */ + typedef int (*opal_crs_base_self_checkpoint_fn_t)(char **restart_cmd); + typedef int (*opal_crs_base_self_restart_fn_t)(void); + typedef int (*opal_crs_base_self_continue_fn_t)(void); + + extern opal_crs_base_self_checkpoint_fn_t crs_base_self_checkpoint_fn; + extern opal_crs_base_self_restart_fn_t crs_base_self_restart_fn; + extern opal_crs_base_self_continue_fn_t crs_base_self_continue_fn; + + OPAL_DECLSPEC int opal_crs_base_self_register_checkpoint_callback + (opal_crs_base_self_checkpoint_fn_t function); + OPAL_DECLSPEC int opal_crs_base_self_register_restart_callback + (opal_crs_base_self_restart_fn_t function); + OPAL_DECLSPEC int opal_crs_base_self_register_continue_callback + (opal_crs_base_self_continue_fn_t function); + END_C_DECLS #endif /* OPAL_CRS_BASE_H */ diff --git a/opal/mca/crs/base/crs_base_close.c b/opal/mca/crs/base/crs_base_close.c index 85991e66e6..220827fc6f 100644 --- a/opal/mca/crs/base/crs_base_close.c +++ b/opal/mca/crs/base/crs_base_close.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -24,6 +24,12 @@ int opal_crs_base_close(void) { + if( !opal_cr_is_enabled ) { + opal_output_verbose(10, opal_crs_base_output, + "crs:close: FT is not enabled, skipping!"); + return OPAL_SUCCESS; + } + /* Call the component's finalize routine */ if( NULL != opal_crs.crs_finalize ) { opal_crs.crs_finalize(); diff --git a/opal/mca/crs/base/crs_base_fns.c b/opal/mca/crs/base/crs_base_fns.c index 67e5ecc2c7..be4b7d6b4e 100644 --- a/opal/mca/crs/base/crs_base_fns.c +++ b/opal/mca/crs/base/crs_base_fns.c @@ -44,13 +44,15 @@ #include "opal/mca/crs/crs.h" #include "opal/mca/crs/base/base.h" +opal_crs_base_self_checkpoint_fn_t crs_base_self_checkpoint_fn; +opal_crs_base_self_restart_fn_t crs_base_self_restart_fn; +opal_crs_base_self_continue_fn_t crs_base_self_continue_fn; + /****************** * Local Functions ******************/ static int metadata_extract_next_token(FILE *file, char **token, char **value); -static int opal_crs_base_metadata_open(FILE ** meta_data, char * location, char * mode); -static char *last_metadata_file = NULL; static char **cleanup_file_argv = NULL; static char **cleanup_dir_argv = NULL; @@ -59,30 +61,30 @@ static char **cleanup_dir_argv = NULL; ******************/ static void opal_crs_base_construct(opal_crs_base_snapshot_t *snapshot) { - snapshot->component_name = NULL; - snapshot->reference_name = opal_crs_base_unique_snapshot_name(getpid()); - snapshot->local_location = opal_crs_base_get_snapshot_directory(snapshot->reference_name); - snapshot->remote_location = strdup(snapshot->local_location); + snapshot->component_name = NULL; + + snapshot->metadata_filename = NULL; + snapshot->metadata = NULL; + snapshot->snapshot_directory = NULL; + snapshot->cold_start = false; } static void opal_crs_base_destruct( opal_crs_base_snapshot_t *snapshot) { - if(NULL != snapshot->reference_name) { - free(snapshot->reference_name); - snapshot->reference_name = NULL; + if(NULL != snapshot->metadata_filename ) { + free(snapshot->metadata_filename); + snapshot->metadata_filename = NULL; } - if(NULL != snapshot->local_location) { - free(snapshot->local_location); - snapshot->local_location = NULL; + + if(NULL != snapshot->metadata) { + fclose(snapshot->metadata); + snapshot->metadata = NULL; } - if(NULL != snapshot->remote_location) { - free(snapshot->remote_location); - snapshot->remote_location = NULL; - } - if(NULL != snapshot->component_name) { - free(snapshot->component_name); - snapshot->component_name = NULL; + + if(NULL != snapshot->snapshot_directory ) { + free(snapshot->snapshot_directory); + snapshot->snapshot_directory = NULL; } } @@ -107,43 +109,29 @@ OBJ_CLASS_INSTANCE(opal_crs_base_ckpt_options_t, /* * Utility functions */ -char * opal_crs_base_unique_snapshot_name(pid_t pid) -{ - char * loc_str = NULL; - - asprintf(&loc_str, "opal_snapshot_%d.ckpt", pid); - - return loc_str; -} - -int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***value) { - int ret, exit_status = OPAL_SUCCESS; - FILE * meta_data = NULL; +int opal_crs_base_metadata_read_token(FILE *metadata, char * token, char ***value) { + int exit_status = OPAL_SUCCESS; char * loc_token = NULL; char * loc_value = NULL; int argc = 0; /* Dummy check */ if( NULL == token ) { + exit_status = OPAL_ERROR; goto cleanup; } - - /* - * Open the metadata file - */ - if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_open(&meta_data, snapshot_loc, "r")) ) { - opal_output(opal_crs_base_output, - "opal:crs:base: opal_crs_base_metadata_read_token: Error: Unable to open the metadata file\n"); - exit_status = ret; + if( NULL == metadata ) { + exit_status = OPAL_ERROR; goto cleanup; } /* * Extract each token and make the records */ + rewind(metadata); do { /* Get next token */ - if( OPAL_SUCCESS != metadata_extract_next_token(meta_data, &loc_token, &loc_value) ) { + if( OPAL_SUCCESS != metadata_extract_next_token(metadata, &loc_token, &loc_value) ) { break; } @@ -151,54 +139,26 @@ int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char *** if(0 == strncmp(token, loc_token, strlen(loc_token)) ) { opal_argv_append(&argc, value, loc_value); } - } while(0 == feof(meta_data) ); + } while(0 == feof(metadata) ); cleanup: - if(NULL != meta_data) { - fclose(meta_data); - meta_data = NULL; - } - + rewind(metadata); return exit_status; } -int opal_crs_base_metadata_write_token(char *snapshot_loc, char * token, char *value) { - int ret, exit_status = OPAL_SUCCESS; - FILE * meta_data = NULL; - - /* Dummy check */ - if( NULL == token || NULL == value) { - goto cleanup; - } - - /* - * Open the metadata file - */ - if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_open(&meta_data, snapshot_loc, "a")) ) { - opal_output(opal_crs_base_output, - "opal:crs:base: opal_crs_base_metadata_write_token: Error: Unable to open the metadata file\n"); - exit_status = ret; - goto cleanup; - } - - fprintf(meta_data, "%s%s\n", token, value); - - cleanup: - if(NULL != meta_data) { - fclose(meta_data); - meta_data = NULL; - } - - return exit_status; -} - -int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** component_name, int *prev_pid) +int opal_crs_base_extract_expected_component(FILE *metadata, char ** component_name, int *prev_pid) { int exit_status = OPAL_SUCCESS; char **pid_argv = NULL; char **name_argv = NULL; - opal_crs_base_metadata_read_token(snapshot_loc, CRS_METADATA_PID, &pid_argv); + /* Dummy check */ + if( NULL == metadata ) { + exit_status = OPAL_ERROR; + goto cleanup; + } + + opal_crs_base_metadata_read_token(metadata, CRS_METADATA_PID, &pid_argv); if( NULL != pid_argv && NULL != pid_argv[0] ) { *prev_pid = atoi(pid_argv[0]); } else { @@ -207,7 +167,7 @@ int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** compone goto cleanup; } - opal_crs_base_metadata_read_token(snapshot_loc, CRS_METADATA_COMP, &name_argv); + opal_crs_base_metadata_read_token(metadata, CRS_METADATA_COMP, &name_argv); if( NULL != name_argv && NULL != name_argv[0] ) { *component_name = strdup(name_argv[0]); } else { @@ -230,68 +190,6 @@ int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** compone return exit_status; } -char * opal_crs_base_get_snapshot_directory(char *uniq_snapshot_name) -{ - char * dir_name = NULL; - - asprintf(&dir_name, "%s/%s", opal_crs_base_snapshot_dir, uniq_snapshot_name); - - return dir_name; -} - -int opal_crs_base_init_snapshot_directory(opal_crs_base_snapshot_t *snapshot) -{ - int ret, exit_status = OPAL_SUCCESS; - mode_t my_mode = S_IRWXU; - char * pid_str = NULL; - - /* - * Make the snapshot directory from the uniq_snapshot_name - */ - if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(snapshot->local_location, my_mode)) ) { - opal_output(opal_crs_base_output, - "opal:crs:base: init_snapshot_directory: Error: Unable to create directory (%s)\n", - snapshot->local_location); - exit_status = ret; - goto cleanup; - } - - /* - * Initialize the metadata file at the top of that directory. - * Add 'BASE' and 'PID' - */ - if( NULL != last_metadata_file ) { - free(last_metadata_file); - last_metadata_file = NULL; - } - last_metadata_file = strdup(snapshot->local_location); - - if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_BASE, "") ) ) { - opal_output(opal_crs_base_output, - "opal:crs:base: init_snapshot_directory: Error: Unable to write BASE to the file (%s/%s)\n", - snapshot->local_location, opal_crs_base_metadata_filename); - exit_status = ret; - goto cleanup; - } - - asprintf(&pid_str, "%d", getpid()); - if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_PID, pid_str) ) ) { - opal_output(opal_crs_base_output, - "opal:crs:base: init_snapshot_directory: Error: Unable to write PID (%s) to the file (%s/%s)\n", - pid_str, snapshot->local_location, opal_crs_base_metadata_filename); - exit_status = ret; - goto cleanup; - } - - cleanup: - if( NULL != pid_str) { - free(pid_str); - pid_str = NULL; - } - - return OPAL_SUCCESS; -} - int opal_crs_base_cleanup_append(char* filename, bool is_dir) { if( NULL == filename ) { @@ -399,6 +297,14 @@ int opal_crs_base_copy_options(opal_crs_base_ckpt_options_t *from, to->term = from->term; to->stop = from->stop; + to->inc_prep_only = from->inc_prep_only; + to->inc_recover_only = from->inc_recover_only; + +#if OPAL_ENABLE_CRDEBUG == 1 + to->attach_debugger = from->attach_debugger; + to->detach_debugger = from->detach_debugger; +#endif + return OPAL_SUCCESS; } @@ -413,6 +319,32 @@ int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target) target->term = false; target->stop = false; + target->inc_prep_only = false; + target->inc_recover_only = false; + +#if OPAL_ENABLE_CRDEBUG == 1 + target->attach_debugger = false; + target->detach_debugger = false; +#endif + + return OPAL_SUCCESS; +} + +int opal_crs_base_self_register_checkpoint_callback(opal_crs_base_self_checkpoint_fn_t function) +{ + crs_base_self_checkpoint_fn = function; + return OPAL_SUCCESS; +} + +int opal_crs_base_self_register_restart_callback(opal_crs_base_self_restart_fn_t function) +{ + crs_base_self_restart_fn = function; + return OPAL_SUCCESS; +} + +int opal_crs_base_self_register_continue_callback(opal_crs_base_self_continue_fn_t function) +{ + crs_base_self_continue_fn = function; return OPAL_SUCCESS; } @@ -420,38 +352,6 @@ int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target) /****************** * Local Functions ******************/ -static int opal_crs_base_metadata_open(FILE **meta_data, char * location, char * mode) -{ - int exit_status = OPAL_SUCCESS; - char * dir_name = NULL; - - if( NULL == location ) { - if( NULL == last_metadata_file ) { - opal_output(0, "Error: No metadata filename specified!"); - exit_status = OPAL_ERROR; - goto cleanup; - } else { - location = last_metadata_file; - } - } - - /* - * Find the snapshot directory, read the metadata file - */ - asprintf(&dir_name, "%s/%s", location, opal_crs_base_metadata_filename); - if (NULL == (*meta_data = fopen(dir_name, mode)) ) { - exit_status = OPAL_ERROR; - goto cleanup; - } - - cleanup: - if( NULL != dir_name ) { - free(dir_name); - dir_name = NULL; - } - return exit_status; -} - static int metadata_extract_next_token(FILE *file, char **token, char **value) { int exit_status = OPAL_SUCCESS; @@ -558,12 +458,20 @@ static int metadata_extract_next_token(FILE *file, char **token, char **value) *value = strdup(local_value); cleanup: - if( NULL != local_token) + if( NULL != local_token) { free(local_token); - if( NULL != local_value) + local_token = NULL; + } + + if( NULL != local_value) { free(local_value); - if( NULL != line) + local_value = NULL; + } + + if( NULL != line) { free(line); + line = NULL; + } return exit_status; } diff --git a/opal/mca/crs/base/crs_base_open.c b/opal/mca/crs/base/crs_base_open.c index 743720ca6c..b172fea33a 100644 --- a/opal/mca/crs/base/crs_base_open.c +++ b/opal/mca/crs/base/crs_base_open.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -48,7 +48,6 @@ opal_crs_base_module_t opal_crs = { }; opal_list_t opal_crs_base_components_available; opal_crs_base_component_t opal_crs_base_selected_component; -char * opal_crs_base_snapshot_dir = NULL; /** * Function for finding and opening either all MCA components, @@ -73,14 +72,6 @@ int opal_crs_base_open(void) } opal_output_set_verbosity(opal_crs_base_output, value); - /* Base snapshot directory */ - mca_base_param_reg_string_name("crs", - "base_snapshot_dir", - "The base directory to use when storing snapshots", - true, false, - strdup("/tmp"), - &opal_crs_base_snapshot_dir); - /* * Which CRS component to open * - NULL or "" = auto-select @@ -90,8 +81,14 @@ int opal_crs_base_open(void) mca_base_param_reg_string_name("crs", NULL, "Which CRS component to use (empty = auto-select)", false, false, - "none", &str_value); - + NULL, &str_value); + + if( !opal_cr_is_enabled ) { + opal_output_verbose(10, opal_crs_base_output, + "crs:open: FT is not enabled, skipping!"); + return OPAL_SUCCESS; + } + /* Open up all available components */ if (OPAL_SUCCESS != (ret = mca_base_components_open("crs", opal_crs_base_output, @@ -110,5 +107,6 @@ int opal_crs_base_open(void) if( NULL != str_value ) { free(str_value); } + return exit_status; } diff --git a/opal/mca/crs/base/crs_base_select.c b/opal/mca/crs/base/crs_base_select.c index 1f1c42f971..d1c6b85b7f 100644 --- a/opal/mca/crs/base/crs_base_select.c +++ b/opal/mca/crs/base/crs_base_select.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2008 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -37,6 +37,12 @@ int opal_crs_base_select(void) opal_crs_base_module_t *best_module = NULL; int int_value = 0; + if( !opal_cr_is_enabled ) { + opal_output_verbose(10, opal_crs_base_output, + "crs:select: FT is not enabled, skipping!"); + return OPAL_SUCCESS; + } + /* * Note: If we are a tool, then we will manually run the selection routine * for the checkpointer. The tool will set the MCA parameter diff --git a/opal/mca/crs/blcr/configure.m4 b/opal/mca/crs/blcr/configure.m4 index cb6adee1c9..2bed18672d 100644 --- a/opal/mca/crs/blcr/configure.m4 +++ b/opal/mca/crs/blcr/configure.m4 @@ -167,6 +167,14 @@ AC_DEFUN([MCA_crs_blcr_CONFIG],[ [BLCRs cr_checkpoint_info.requester member availability]) $1]) + # + # Require either a working cr_request_file() or cr_request_checkpoint() function + # + AS_IF([test "$crs_blcr_have_working_cr_request" = "0" -a "$crs_blcr_have_cr_request_checkpoint" = "0"], + [$2 + check_crs_blcr_good="no" + AC_MSG_WARN([The BLCR CRS component requires either the cr_request_checkpoint() or cr_request_file() functions])]) + # # Reset the flags # diff --git a/opal/mca/crs/blcr/crs_blcr_module.c b/opal/mca/crs/blcr/crs_blcr_module.c index d62a4f2d48..69316ac8dc 100644 --- a/opal/mca/crs/blcr/crs_blcr_module.c +++ b/opal/mca/crs/blcr/crs_blcr_module.c @@ -34,6 +34,7 @@ #include "opal/mca/base/mca_base_param.h" +#include "opal/threads/threads.h" #include "opal/threads/mutex.h" #include "opal/threads/condition.h" @@ -94,20 +95,26 @@ OBJ_CLASS_INSTANCE(opal_crs_blcr_snapshot_t, /****************** * Local Functions ******************/ -static int blcr_checkpoint_peer(pid_t pid, char * local_dir, char ** fname); static int blcr_get_checkpoint_filename(char **fname, pid_t pid); static int opal_crs_blcr_thread_callback(void *arg); static int opal_crs_blcr_signal_callback(void *arg); -static int opal_crs_blcr_checkpoint_cmd(pid_t pid, char * local_dir, char **fname, char **cmd); static int opal_crs_blcr_restart_cmd(char *fname, char **cmd); -static int blcr_update_snapshot_metadata(opal_crs_blcr_snapshot_t *snapshot); static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot); +#if OPAL_ENABLE_CRDEBUG == 1 +static void MPIR_checkpoint_debugger_crs_hook(cr_hook_event_t event); +#endif + /************************* * Local Global Variables *************************/ +#if OPAL_ENABLE_CRDEBUG == 1 +static opal_thread_t *checkpoint_thread_id = NULL; +static bool blcr_crdebug_refreshed_env = false; +#endif + static cr_client_id_t client_id; static cr_callback_id_t cr_thread_callback_id; static cr_callback_id_t cr_signal_callback_id; @@ -127,8 +134,10 @@ void opal_crs_blcr_construct(opal_crs_blcr_snapshot_t *snapshot) { } void opal_crs_blcr_destruct( opal_crs_blcr_snapshot_t *snapshot) { - if(NULL != snapshot->context_filename) + if(NULL != snapshot->context_filename) { free(snapshot->context_filename); + snapshot->context_filename = NULL; + } } /***************** @@ -167,6 +176,10 @@ int opal_crs_blcr_module_init(void) } } +#if OPAL_ENABLE_CRDEBUG == 1 + blcr_crdebug_refreshed_env = false; +#endif + blcr_restart_cmd = strdup("cr_restart"); blcr_checkpoint_cmd = strdup("cr_checkpoint"); @@ -190,6 +203,20 @@ int opal_crs_blcr_module_init(void) cr_signal_callback_id = cr_register_callback(opal_crs_blcr_signal_callback, crs_blcr_signal_callback_arg, CR_SIGNAL_CONTEXT); + +#if OPAL_ENABLE_CRDEBUG == 1 + /* + * Checkpoint/restart enabled debugging hooks + * "NO_CALLBACKS" -> non-MPI threads + * "SIGNAL_CONTEXT" -> MPI threads + * "THREAD_CONTEXT" -> BLCR threads + */ + cr_register_hook(CR_HOOK_CONT_NO_CALLBACKS, MPIR_checkpoint_debugger_crs_hook); + cr_register_hook(CR_HOOK_CONT_SIGNAL_CONTEXT, MPIR_checkpoint_debugger_crs_hook); + + cr_register_hook(CR_HOOK_RSTRT_NO_CALLBACKS, MPIR_checkpoint_debugger_crs_hook); + cr_register_hook(CR_HOOK_RSTRT_SIGNAL_CONTEXT, MPIR_checkpoint_debugger_crs_hook); +#endif } /* @@ -262,6 +289,17 @@ int opal_crs_blcr_module_finalize(void) cr_replace_callback(cr_thread_callback_id, NULL, NULL, CR_THREAD_CONTEXT); /* Unload the signal callback */ cr_replace_callback(cr_signal_callback_id, NULL, NULL, CR_SIGNAL_CONTEXT); + +#if OPAL_ENABLE_CRDEBUG == 1 + /* + * Checkpoint/restart enabled debugging hooks + */ + cr_register_hook(CR_HOOK_CONT_NO_CALLBACKS, NULL); + cr_register_hook(CR_HOOK_CONT_SIGNAL_CONTEXT, NULL); + + cr_register_hook(CR_HOOK_RSTRT_NO_CALLBACKS, NULL); + cr_register_hook(CR_HOOK_RSTRT_SIGNAL_CONTEXT, NULL); +#endif } /* BLCR does not have a finalization routine */ @@ -275,175 +313,158 @@ int opal_crs_blcr_checkpoint(pid_t pid, opal_crs_state_type_t *state) { int ret, exit_status = OPAL_SUCCESS; - opal_crs_blcr_snapshot_t *snapshot = OBJ_NEW(opal_crs_blcr_snapshot_t); + opal_crs_blcr_snapshot_t *snapshot = NULL; #if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 cr_checkpoint_args_t cr_args; static cr_checkpoint_handle_t cr_handle = (cr_checkpoint_handle_t)(-1); #endif + int fd = 0; + char *loc_fname = NULL; + + if( pid != my_pid ) { + opal_output(0, "crs:blcr: checkpoint(%d, ---): Checkpointing of peers not allowed!", pid); + exit_status = OPAL_ERROR; + goto cleanup; + } opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, "crs:blcr: checkpoint(%d, ---)", pid); - if(NULL != snapshot->super.reference_name) - free(snapshot->super.reference_name); - snapshot->super.reference_name = strdup(base_snapshot->reference_name); - - if(NULL != snapshot->super.local_location) - free(snapshot->super.local_location); - snapshot->super.local_location = strdup(base_snapshot->local_location); - - if(NULL != snapshot->super.remote_location) - free(snapshot->super.remote_location); - snapshot->super.remote_location = strdup(base_snapshot->remote_location); + snapshot = (opal_crs_blcr_snapshot_t *)base_snapshot; /* * Update the snapshot metadata */ snapshot->super.component_name = strdup(mca_crs_blcr_component.super.base_version.mca_component_name); - if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, snapshot->super.component_name) ) ) { - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint(): Error: Unable to write component name to the directory for (%s).", - snapshot->super.reference_name); - exit_status = ret; - goto cleanup; + blcr_get_checkpoint_filename(&(snapshot->context_filename), pid); + + if( NULL == snapshot->super.metadata ) { + if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) { + opal_output(mca_crs_blcr_component.super.output_handle, + "crs:blcr: checkpoint(): Error: Unable to open the file (%s)", + snapshot->super.metadata_filename); + exit_status = OPAL_ERROR; + goto cleanup; + } } + fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->super.component_name); + fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_CONTEXT, snapshot->context_filename); + + fclose(snapshot->super.metadata ); + snapshot->super.metadata = NULL; /* * If we can checkpointing ourselves do so: * use cr_request_checkpoint() if available, and cr_request_file() if not */ -#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 || CRS_BLCR_HAVE_CR_REQUEST == 1 - if( pid == my_pid ) { - char *loc_fname = NULL; + if( opal_crs_blcr_dev_null ) { + loc_fname = strdup("/dev/null"); + } else { + asprintf(&loc_fname, "%s/%s", snapshot->super.snapshot_directory, snapshot->context_filename); + } - blcr_get_checkpoint_filename(&(snapshot->context_filename), pid); - if( opal_crs_blcr_dev_null ) { - loc_fname = strdup("/dev/null"); - } else { - asprintf(&loc_fname, "%s/%s", snapshot->super.local_location, snapshot->context_filename); - } +#if OPAL_ENABLE_CRDEBUG == 1 + /* Make sure to identify the checkpointing thread, so that it is not + * prevented from requesting the checkpoint after the debugger detaches + */ + opal_cr_debug_set_current_ckpt_thread_self(); + checkpoint_thread_id = opal_thread_get_self(); + blcr_crdebug_refreshed_env = false; + /* If checkpoint/restart enabled debugging then mark detachment place */ + if( MPIR_debug_with_checkpoint ) { opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint SELF <%s>", - loc_fname); + "crs:blcr: checkpoint(): Detaching debugger..."); + MPIR_checkpoint_debugger_detach(); + } +#endif + opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, + "crs:blcr: checkpoint SELF <%s>", + loc_fname); + +#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 || CRS_BLCR_HAVE_CR_REQUEST == 1 #if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 - { - int fd = 0; - fd = open(loc_fname, - O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE, - S_IRUSR | S_IWUSR); - if( fd < 0 ) { + fd = open(loc_fname, + O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE, + S_IRUSR | S_IWUSR); + if( fd < 0 ) { + *state = OPAL_CRS_ERROR; + opal_output(mca_crs_blcr_component.super.output_handle, + "crs:blcr: checkpoint(): Error: Unable to open checkpoint file (%s) for pid (%d)", + loc_fname, pid); + exit_status = OPAL_ERROR; + goto cleanup; + } + + cr_initialize_checkpoint_args_t(&cr_args); + cr_args.cr_scope = CR_SCOPE_PROC; + cr_args.cr_fd = fd; + if( options->stop ) { + cr_args.cr_signal = SIGSTOP; + } + + ret = cr_request_checkpoint(&cr_args, &cr_handle); + if( ret < 0 ) { + close(cr_args.cr_fd); + *state = OPAL_CRS_ERROR; + opal_output(mca_crs_blcr_component.super.output_handle, + "crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s)", + pid, loc_fname); + exit_status = ret; + goto cleanup; + } + + /* Wait for checkpoint to finish */ + do { + ret = cr_poll_checkpoint(&cr_handle, NULL); + if( ret < 0 ) { + /* Check if restarting. This is not an error. */ + if( (ret == CR_POLL_CHKPT_ERR_POST) && (errno == CR_ERESTARTED) ) { + ret = 0; + break; + } + /* If Call was interrupted by a signal, retry the call */ + else if (errno == EINTR) { + ; + } + /* Otherwise this is a real error that we need to deal with */ + else { *state = OPAL_CRS_ERROR; opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint(): Error: Unable to open checkpoint file (%s) for pid (%d)", - loc_fname, pid); - exit_status = OPAL_ERROR; - goto cleanup; - } - - cr_initialize_checkpoint_args_t(&cr_args); - cr_args.cr_scope = CR_SCOPE_PROC; - cr_args.cr_fd = fd; - if( options->stop ) { - cr_args.cr_signal = SIGSTOP; - } - - ret = cr_request_checkpoint(&cr_args, &cr_handle); - if( ret < 0 ) { - close(cr_args.cr_fd); - *state = OPAL_CRS_ERROR; - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s)", - pid, loc_fname); + "crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s) - poll failed with (%d)", + pid, loc_fname, ret); exit_status = ret; goto cleanup; } - - /* Wait for checkpoint to finish */ - do { - ret = cr_poll_checkpoint(&cr_handle, NULL); - if( ret < 0 ) { - /* Check if restarting. This is not an error. */ - if( (ret == CR_POLL_CHKPT_ERR_POST) && (errno == CR_ERESTARTED) ) { - ret = 0; - break; - } - /* If Call was interrupted by a signal, retry the call */ - else if (errno == EINTR) { - ; - } - /* Otherwise this is a real error that we need to deal with */ - else { - *state = OPAL_CRS_ERROR; - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s) - poll failed with (%d)", - pid, loc_fname, ret); - exit_status = ret; - goto cleanup; - } - } - } while( ret < 0 ); - - /* Close the file */ - close(cr_args.cr_fd); } + } while( ret < 0 ); + + /* Close the file */ + close(cr_args.cr_fd); #else - /* Request a checkpoint be taken of the current process. - * Since we are not guaranteed to finish the checkpoint before this - * returns, we also need to wait for it. - */ - cr_request_file(loc_fname); + /* Request a checkpoint be taken of the current process. + * Since we are not guaranteed to finish the checkpoint before this + * returns, we also need to wait for it. + */ + cr_request_file(loc_fname); - /* Wait for checkpoint to finish */ - do { - usleep(1000); /* JJH Do we really want to sleep? */ - } while(CR_STATE_IDLE != cr_status()); + /* Wait for checkpoint to finish */ + do { + usleep(1000); /* JJH Do we really want to sleep? */ + } while(CR_STATE_IDLE != cr_status()); +#endif #endif - *state = blcr_current_state; - free(loc_fname); - } - /* - * Checkpointing another process - */ - else -#endif - { - ret = blcr_checkpoint_peer(pid, snapshot->super.local_location, &(snapshot->context_filename)); - - if(OPAL_SUCCESS != ret) { - *state = OPAL_CRS_ERROR; - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d)", - pid); - exit_status = ret; - goto cleanup; - } - - *state = blcr_current_state; - } + *state = blcr_current_state; + free(loc_fname); - if(*state == OPAL_CRS_CONTINUE) { - /* - * Update the metadata file - */ - if( OPAL_SUCCESS != (ret = blcr_update_snapshot_metadata(snapshot)) ) { - *state = OPAL_CRS_ERROR; - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint(): Error: Unable to update metadata for snapshot (%s).", - snapshot->super.reference_name); - exit_status = ret; - goto cleanup; - } + cleanup: + if( NULL != snapshot->super.metadata ) { + fclose(snapshot->super.metadata ); + snapshot->super.metadata = NULL; } - /* - * Return to the caller - */ - base_snapshot = &(snapshot->super); - - cleanup: return exit_status; } @@ -459,7 +480,7 @@ int opal_crs_blcr_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch snapshot->super = *base_snapshot; opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, - "crs:blcr: restart(%s, %d)", snapshot->super.reference_name, spawn_child); + "crs:blcr: restart(--, %d)", spawn_child); /* * If we need to reconstruct the snapshot, @@ -486,10 +507,6 @@ int opal_crs_blcr_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch goto cleanup; } - - /* - * Restart by replacing this process - */ /* Need to shutdown the event engine before this. * for some reason the BLCR checkpointer and our event engine don't get * along very well. @@ -586,94 +603,6 @@ int opal_crs_blcr_enable_checkpoint(void) /***************************** * Local Function Definitions *****************************/ -static int blcr_checkpoint_peer(pid_t pid, char * local_dir, char ** fname) -{ - char **cr_argv = NULL; - char *cr_cmd = NULL; - int ret; - pid_t child_pid; - int exit_status = OPAL_SUCCESS; - int status, child_status; - - opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint_peer(%d, --)", pid); - - /* - * Get the checkpoint command - */ - if ( OPAL_SUCCESS != (ret = opal_crs_blcr_checkpoint_cmd(pid, local_dir, fname, &cr_cmd)) ) { - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint_peer: Failed to generate checkpoint command :(%d):", ret); - exit_status = ret; - goto cleanup; - } - if ( NULL == (cr_argv = opal_argv_split(cr_cmd, ' ')) ) { - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint_peer: Failed to opal_argv_split :(%d):", ret); - exit_status = OPAL_ERROR; - goto cleanup; - } - - /* - * Fork a child to do the checkpoint - */ - blcr_current_state = OPAL_CRS_CHECKPOINT; - - child_pid = fork(); - - if(0 == child_pid) { - /* Child Process */ - opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, - "crs:blcr: blcr_checkpoint_peer: exec :(%s, %s):", - strdup(blcr_checkpoint_cmd), - opal_argv_join(cr_argv, ' ')); - - status = execvp(strdup(blcr_checkpoint_cmd), cr_argv); - - if(status < 0) { - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: blcr_checkpoint_peer: Child failed to execute :(%d):", status); - } - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: blcr_checkpoint_peer: execvp returned %d", status); - } - else if(child_pid > 0) { - /* Don't waitpid here since we don't really want to restart from inside waitpid ;) */ - while(OPAL_CRS_RESTART != blcr_current_state && - OPAL_CRS_CONTINUE != blcr_current_state ) { - OPAL_THREAD_LOCK(&blcr_lock); - opal_condition_wait(&blcr_cond, &blcr_lock); - OPAL_THREAD_UNLOCK(&blcr_lock); - } - - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: blcr_checkpoint_peer: Thread finished with status %d", blcr_current_state); - - if(OPAL_CRS_CONTINUE == blcr_current_state) { - /* Wait for the child only if we are continuing */ - if( 0 > waitpid(child_pid, &child_status, 0) ) { - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: blcr_checkpoint_peer: waitpid returned %d", child_status); - } - } - } - else { - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: blcr_checkpoint_peer: fork failed :(%d):", child_pid); - } - - /* - * Cleanup - */ -cleanup: - if(NULL != cr_cmd) - free(cr_cmd); - if(NULL != cr_argv) - opal_argv_free(cr_argv); - - return exit_status; -} - static int opal_crs_blcr_thread_callback(void *arg) { const struct cr_checkpoint_info *ckpt_info = cr_get_checkpoint_info(); int ret; @@ -700,6 +629,11 @@ static int opal_crs_blcr_thread_callback(void *arg) { else #endif { + if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_CRS_PRE_CKPT, + OMPI_CR_INC_STATE_PREPARE)) ) { + ; + } + ret = cr_checkpoint(0); } @@ -720,6 +654,13 @@ static int opal_crs_blcr_thread_callback(void *arg) { blcr_current_state = OPAL_CRS_CONTINUE; } + if( OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_CRS_POST_CKPT, + (blcr_current_state == OPAL_CRS_CONTINUE ? + OMPI_CR_INC_STATE_CONTINUE : + OMPI_CR_INC_STATE_RESTART))) ) { + ; + } + OPAL_THREAD_UNLOCK(&blcr_lock); opal_condition_signal(&blcr_cond); @@ -747,66 +688,6 @@ static int opal_crs_blcr_signal_callback(void *arg) { return 0; } -static int opal_crs_blcr_checkpoint_cmd(pid_t pid, char * local_dir, char **fname, char **cmd) -{ - char **cr_argv = NULL; - int argc = 0, ret; - char * pid_str; - int exit_status = OPAL_SUCCESS; - char * loc_fname = NULL; - - blcr_get_checkpoint_filename(fname, pid); - - opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, - "crs:blcr: checkpoint_cmd(%d)", pid); - - asprintf(&loc_fname, "%s/%s", local_dir, *fname); - - /* - * Build the command - */ - if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(blcr_checkpoint_cmd)))) { - exit_status = ret; - goto cleanup; - } - - if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup("--pid")))) { - exit_status = ret; - goto cleanup; - } - - asprintf(&pid_str, "%d", pid); - if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(pid_str)))) { - exit_status = ret; - goto cleanup; - } - - if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup("--file")))) { - exit_status = ret; - goto cleanup; - } - - if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(loc_fname)))) { - exit_status = ret; - goto cleanup; - } - - cleanup: - if(exit_status != OPAL_SUCCESS) - *cmd = NULL; - else - *cmd = opal_argv_join(cr_argv, ' '); - - if(NULL != pid_str) - free(pid_str); - if( NULL != cr_argv) - opal_argv_free(cr_argv); - if(NULL != loc_fname) - free(loc_fname); - - return exit_status; -} - static int opal_crs_blcr_restart_cmd(char *fname, char **cmd) { opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, @@ -833,32 +714,6 @@ static int blcr_get_checkpoint_filename(char **fname, pid_t pid) return OPAL_SUCCESS; } -static int blcr_update_snapshot_metadata(opal_crs_blcr_snapshot_t *snapshot) { - int exit_status = OPAL_SUCCESS; - - opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, - "crs:blcr: update_snapshot_metadata(%s)", snapshot->super.reference_name); - - /* Bozo check to make sure this snapshot is ours */ - if ( 0 != strncmp(mca_crs_blcr_component.super.base_version.mca_component_name, - snapshot->super.component_name, - strlen(snapshot->super.component_name)) ) { - exit_status = OPAL_ERROR; - opal_output(mca_crs_blcr_component.super.output_handle, - "crs:blcr: blcr_update_snapshot_metadata: Error: This snapshot (%s) is not intended for us (%s)\n", - snapshot->super.component_name, mca_crs_blcr_component.super.base_version.mca_component_name); - goto cleanup; - } - - /* - * Append to the metadata file the context filename - */ - opal_crs_base_metadata_write_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, snapshot->context_filename); - - cleanup: - return exit_status; -} - static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) { int ret, exit_status = OPAL_SUCCESS; char **tmp_argv = NULL; @@ -866,16 +721,25 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) { int prev_pid; opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, - "crs:blcr: cold_start(%s)", snapshot->super.reference_name); + "crs:blcr: cold_start()"); /* * Find the snapshot directory, read the metadata file */ - if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.local_location, + if( NULL == snapshot->super.metadata ) { + if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "r")) ) { + opal_output(mca_crs_blcr_component.super.output_handle, + "crs:blcr: checkpoint(): Error: Unable to open the file (%s)", + snapshot->super.metadata_filename); + exit_status = OPAL_ERROR; + goto cleanup; + } + } + if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.metadata, &component_name, &prev_pid) ) ) { opal_output(mca_crs_blcr_component.super.output_handle, "crs:blcr: blcr_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.", - snapshot->super.local_location, ret); + snapshot->super.metadata_filename, ret); exit_status = ret; goto cleanup; } @@ -895,15 +759,15 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) { /* * Context Filename */ - opal_crs_base_metadata_read_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, &tmp_argv); + opal_crs_base_metadata_read_token(snapshot->super.metadata, CRS_METADATA_CONTEXT, &tmp_argv); if( NULL == tmp_argv ) { opal_output(mca_crs_blcr_component.super.output_handle, "crs:blcr: blcr_cold_start: Error: Failed to read the %s token from the local checkpoint in %s", - CRS_METADATA_CONTEXT, snapshot->super.local_location); + CRS_METADATA_CONTEXT, snapshot->super.snapshot_directory); exit_status = OPAL_ERROR; goto cleanup; } - asprintf(&snapshot->context_filename, "%s/%s", snapshot->super.local_location, tmp_argv[0]); + asprintf(&snapshot->context_filename, "%s/%s", snapshot->super.snapshot_directory, tmp_argv[0]); /* * Reset the cold_start flag @@ -916,5 +780,75 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) { tmp_argv = NULL; } + if( NULL != snapshot->super.metadata ) { + fclose(snapshot->super.metadata); + snapshot->super.metadata = NULL; + } + return exit_status; } + +#if OPAL_ENABLE_CRDEBUG == 1 +static void MPIR_checkpoint_debugger_crs_hook(cr_hook_event_t event) { + opal_thread_t *my_thread_id = NULL; + my_thread_id = opal_thread_get_self(); + + /* Non-MPI threads */ + if(event == CR_HOOK_RSTRT_NO_CALLBACKS ) { + /* wait for the MPI thread to refresh the environment for us */ + while(!blcr_crdebug_refreshed_env) { + sched_yield(); + } + } + /* MPI threads */ + else if(event == CR_HOOK_RSTRT_SIGNAL_CONTEXT ) { + if( opal_thread_self_compare(checkpoint_thread_id) ) { + opal_cr_refresh_environ(my_pid); + blcr_crdebug_refreshed_env = true; + } else { + while(!blcr_crdebug_refreshed_env) { + sched_yield(); + } + } + } + + /* + * Some debugging output + */ + /* Non-MPI threads */ + if( event == CR_HOOK_CONT_NO_CALLBACKS ) { + opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, + "crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Continue (Non-MPI). (%d)", + (int)my_thread_id->t_handle); + } + else if(event == CR_HOOK_RSTRT_NO_CALLBACKS ) { + opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, + "crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Restart (Non-MPI). (%d)", + (int)my_thread_id->t_handle); + } + /* MPI Threads */ + else if( event == CR_HOOK_CONT_SIGNAL_CONTEXT ) { + opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, + "crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Continue (MPI)."); + } + else if(event == CR_HOOK_RSTRT_SIGNAL_CONTEXT ) { + opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, + "crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Restart (MPI)."); + } + + /* + * Enter the breakpoint function. + * If no debugger intends on attaching, then this function is expected to + * return immediately. + * + * If this is an MPI thread then odds are that this is the checkpointing + * thread, in which case this function will return immediately allowing + * it to prepare the MPI library before signaling to the debugger that + * it is safe to attach, if necessary. + */ + MPIR_checkpoint_debugger_waitpoint(); + + opal_output_verbose(10, mca_crs_blcr_component.super.output_handle, + "crs:blcr: MPIR_checkpoint_debugger_crs_hook: Finished..."); + } +#endif diff --git a/opal/mca/crs/crs.h b/opal/mca/crs/crs.h index 96820e5dba..87ca99f92d 100644 --- a/opal/mca/crs/crs.h +++ b/opal/mca/crs/crs.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -79,6 +79,19 @@ struct opal_crs_base_ckpt_options_1_0_0_t { bool term; /** Send SIGSTOP after checkpoint */ bool stop; + + /** INC Prep Only */ + bool inc_prep_only; + + /** INC Recover Only */ + bool inc_recover_only; + +#if OPAL_ENABLE_CRDEBUG == 1 + /** Wait for debugger to attach after checkpoint */ + bool attach_debugger; + /** Do not wait for debugger to reattach after checkpoint */ + bool detach_debugger; +#endif }; typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_1_0_0_t; typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_t; @@ -96,12 +109,14 @@ struct opal_crs_base_snapshot_1_0_0_t { /** MCA Component name */ char * component_name; - /** Unique name of snapshot */ - char * reference_name; + /** Metadata filename */ + char * metadata_filename; + + /** Metadata fd */ + FILE * metadata; /** Absolute path the the snapshot directory */ - char * local_location; - char * remote_location; + char * snapshot_directory; /** Cold Start: * If we are restarting cold, then we need to recreate this structure diff --git a/opal/mca/crs/none/crs_none_module.c b/opal/mca/crs/none/crs_none_module.c index 2ee8cd93bf..4c748a4c5f 100644 --- a/opal/mca/crs/none/crs_none_module.c +++ b/opal/mca/crs/none/crs_none_module.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * * $COPYRIGHT$ @@ -58,25 +58,25 @@ int opal_crs_none_checkpoint(pid_t pid, opal_crs_base_ckpt_options_t *options, opal_crs_state_type_t *state) { - int ret; - *state = OPAL_CRS_CONTINUE; snapshot->component_name = strdup("none"); - snapshot->reference_name = strdup("none"); - snapshot->local_location = strdup(""); - snapshot->remote_location = strdup(""); snapshot->cold_start = false; /* * Update the snapshot metadata */ - if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, "none") ) ) { - opal_output(0, - "crs:none: checkpoint(): Error: Unable to write component name to the directory for (%s).", - snapshot->reference_name); - return ret; + if( NULL == snapshot->metadata ) { + if (NULL == (snapshot->metadata = fopen(snapshot->metadata_filename, "a")) ) { + opal_output(0, + "crs:none: checkpoint(): Error: Unable to open the file (%s)", + snapshot->metadata_filename); + return OPAL_ERROR; + } } + fprintf(snapshot->metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->component_name); + fclose(snapshot->metadata); + snapshot->metadata = NULL; if( options->stop ) { opal_output(0, @@ -88,28 +88,43 @@ int opal_crs_none_checkpoint(pid_t pid, int opal_crs_none_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_child, pid_t *child_pid) { + int exit_status = OPAL_SUCCESS; char **tmp_argv = NULL; char **cr_argv = NULL; int status; *child_pid = getpid(); - opal_crs_base_metadata_read_token(base_snapshot->local_location, CRS_METADATA_CONTEXT, &tmp_argv); + if( NULL == base_snapshot->metadata ) { + if (NULL == (base_snapshot->metadata = fopen(base_snapshot->metadata_filename, "a")) ) { + opal_output(0, + "crs:none: checkpoint(): Error: Unable to open the file (%s)", + base_snapshot->metadata_filename); + exit_status = OPAL_ERROR; + goto cleanup; + } + } + + opal_crs_base_metadata_read_token(base_snapshot->metadata, CRS_METADATA_CONTEXT, &tmp_argv); + if( NULL == tmp_argv ) { opal_output(opal_crs_base_output, "crs:none: none_restart: Error: Failed to read the %s token from the local checkpoint in %s", - CRS_METADATA_CONTEXT, base_snapshot->local_location); - return OPAL_ERROR; + CRS_METADATA_CONTEXT, base_snapshot->metadata_filename); + exit_status = OPAL_ERROR; + goto cleanup; } if( opal_argv_count(tmp_argv) <= 0 ) { opal_output_verbose(10, opal_crs_base_output, "crs:none: none_restart: No command line to exec, so just returning"); - return OPAL_SUCCESS; + exit_status = OPAL_SUCCESS; + goto cleanup; } if ( NULL == (cr_argv = opal_argv_split(tmp_argv[0], ' ')) ) { - return OPAL_ERROR; + exit_status = OPAL_ERROR; + goto cleanup; } if( !spawn_child ) { @@ -126,14 +141,20 @@ int opal_crs_none_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch } opal_output(opal_crs_base_output, "crs:none: none_restart: execvp returned %d", status); - return status; + exit_status = status; + goto cleanup; } else { opal_output(opal_crs_base_output, "crs:none: none_restart: Spawn not implemented"); - return OPAL_ERR_NOT_IMPLEMENTED; + exit_status = OPAL_ERR_NOT_IMPLEMENTED; + goto cleanup; } - return OPAL_SUCCESS; + cleanup: + fclose(base_snapshot->metadata); + base_snapshot->metadata = NULL; + + return exit_status; } int opal_crs_none_disable_checkpoint(void) diff --git a/opal/mca/crs/opal_crs.7in b/opal/mca/crs/opal_crs.7in index a9c0f59ac2..6719b3c1c3 100644 --- a/opal/mca/crs/opal_crs.7in +++ b/opal/mca/crs/opal_crs.7in @@ -1,5 +1,5 @@ .\" -.\" Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana +.\" Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana .\" University Research and Technology .\" Corporation. All rights reserved. .\" Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved. @@ -89,10 +89,6 @@ The following MCA parameters apply to all components: crs_base_verbose Set the verbosity level for all components. Default is 0, or silent except on error. . -.TP -crs_base_snapshot_dir -The directory to store the checkpoint snapshots. Default is \fB/tmp\fP. -. .\" Self Component .\" ****************** .SS self CRS Component diff --git a/opal/mca/crs/self/crs_self_module.c b/opal/mca/crs/self/crs_self_module.c index 2838724724..60336f3076 100644 --- a/opal/mca/crs/self/crs_self_module.c +++ b/opal/mca/crs/self/crs_self_module.c @@ -285,17 +285,11 @@ int opal_crs_self_checkpoint(pid_t pid, /* * Setup for snapshot directory creation */ - if(NULL != snapshot->super.reference_name) - free(snapshot->super.reference_name); - snapshot->super.reference_name = strdup(base_snapshot->reference_name); - - if(NULL != snapshot->super.local_location) - free(snapshot->super.local_location); - snapshot->super.local_location = strdup(base_snapshot->local_location); - - if(NULL != snapshot->super.remote_location) - free(snapshot->super.remote_location); - snapshot->super.remote_location = strdup(base_snapshot->remote_location); + snapshot->super = *base_snapshot; +#if 0 + snapshot->super.snapshot_directory = strdup(base_snapshot->snapshot_directory); + snapshot->super.metadata_filename = strdup(base_snapshot->metadata_filename); +#endif opal_output_verbose(10, mca_crs_self_component.super.output_handle, "crs:self: checkpoint(%d, ---)", pid); @@ -310,13 +304,16 @@ int opal_crs_self_checkpoint(pid_t pid, * Update the snapshot metadata */ snapshot->super.component_name = strdup(mca_crs_self_component.super.base_version.mca_component_name); - if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, snapshot->super.component_name) ) ) { - opal_output(mca_crs_self_component.super.output_handle, - "crs:self: checkpoint(): Error: Unable to write component name to the directory for (%s).", - snapshot->super.reference_name); - exit_status = ret; - goto cleanup; + if( NULL == snapshot->super.metadata ) { + if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) { + opal_output(mca_crs_self_component.super.output_handle, + "crs:self: checkpoint(): Error: Unable to open the file (%s)", + snapshot->super.metadata_filename); + exit_status = OPAL_ERROR; + goto cleanup; + } } + fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->super.component_name); /* * Call the user callback function @@ -350,7 +347,7 @@ int opal_crs_self_checkpoint(pid_t pid, *state = OPAL_CRS_ERROR; opal_output(mca_crs_self_component.super.output_handle, "crs:self: checkpoint(): Error: Unable to update metadata for snapshot (%s).", - snapshot->super.reference_name); + snapshot->super.metadata_filename); exit_status = ret; goto cleanup; } @@ -392,7 +389,7 @@ int opal_crs_self_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch snapshot->super = *base_snapshot; opal_output_verbose(10, mca_crs_self_component.super.output_handle, - "crs:self: restart(%s, %d)", snapshot->super.reference_name, spawn_child); + "crs:self: restart(%d)", spawn_child); /* * If we need to reconstruct the snapshot @@ -675,16 +672,25 @@ static int self_cold_start(opal_crs_self_snapshot_t *snapshot) { int prev_pid; opal_output_verbose(10, mca_crs_self_component.super.output_handle, - "crs:self: cold_start(%s)", snapshot->super.reference_name); + "crs:self: cold_start()"); /* * Find the snapshot directory, read the metadata file */ - if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.local_location, + if( NULL == snapshot->super.metadata ) { + if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) { + opal_output(mca_crs_self_component.super.output_handle, + "crs:self: checkpoint(): Error: Unable to open the file (%s)", + snapshot->super.metadata_filename); + exit_status = OPAL_ERROR; + goto cleanup; + } + } + if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.metadata, &component_name, &prev_pid) ) ) { opal_output(mca_crs_self_component.super.output_handle, "crs:self: self_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.", - snapshot->super.local_location, ret); + snapshot->super.metadata_filename, ret); exit_status = ret; goto cleanup; } @@ -705,11 +711,11 @@ static int self_cold_start(opal_crs_self_snapshot_t *snapshot) { * Restart command * JJH: Command lines limited to 256 chars. */ - opal_crs_base_metadata_read_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, &tmp_argv); + opal_crs_base_metadata_read_token(snapshot->super.metadata, CRS_METADATA_CONTEXT, &tmp_argv); if( NULL == tmp_argv ) { opal_output(mca_crs_self_component.super.output_handle, "crs:self: self_cold_start: Error: Failed to read the %s token from the local checkpoint in %s", - CRS_METADATA_CONTEXT, snapshot->super.local_location); + CRS_METADATA_CONTEXT, snapshot->super.snapshot_directory); exit_status = OPAL_ERROR; goto cleanup; } @@ -742,13 +748,13 @@ static int self_update_snapshot_metadata(opal_crs_self_snapshot_t *snapshot) { opal_output_verbose(10, mca_crs_self_component.super.output_handle, "crs:self: update_snapshot_metadata(%s)", - snapshot->super.reference_name); + snapshot->super.metadata_filename); /* * Append to the metadata file the command line to restart with * - How user wants us to restart */ - opal_crs_base_metadata_write_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, snapshot->cmd_line); + fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_CONTEXT, snapshot->cmd_line); cleanup: return exit_status; diff --git a/opal/runtime/opal_cr.c b/opal/runtime/opal_cr.c index 82618af313..5c2dd3111e 100644 --- a/opal/runtime/opal_cr.c +++ b/opal/runtime/opal_cr.c @@ -74,9 +74,21 @@ /****************** * Global Var Decls ******************/ +#if OPAL_ENABLE_CRDEBUG == 1 +static opal_thread_t **opal_cr_debug_free_threads = NULL; +static int opal_cr_debug_num_free_threads = 0; +static int opal_cr_debug_threads_already_waiting = false; + +int MPIR_debug_with_checkpoint = 0; +static volatile int MPIR_checkpoint_debug_gate = 0; + +int opal_cr_debug_signal = 0; +#endif + bool opal_cr_stall_check = false; bool opal_cr_currently_stalled = false; int opal_cr_output; +int opal_cr_initalized = 0; static double opal_cr_get_time(void); static void display_indv_timer_core(double diff, char *str); @@ -89,10 +101,11 @@ int opal_cr_timing_target_rank = 0; /****************** * Local Functions & Var Decls ******************/ -static int extract_env_vars(int prev_pid); +static int extract_env_vars(int prev_pid, char * file_name); static void opal_cr_sigpipe_debug_signal_handler (int signo); +static opal_cr_user_inc_callback_fn_t cur_user_coord_callback[OMPI_CR_INC_MAX] = {NULL}; static opal_cr_coord_callback_fn_t cur_coord_callback = NULL; static opal_cr_notify_callback_fn_t cur_notify_callback = NULL; @@ -179,13 +192,11 @@ int opal_cr_set_enabled(bool en) return OPAL_SUCCESS; } -int opal_cr_initalized = 0; - int opal_cr_init(void ) { int ret, exit_status = OPAL_SUCCESS; opal_cr_coord_callback_fn_t prev_coord_func; - int val; + int val, t; if( ++opal_cr_initalized != 1 ) { if( opal_cr_initalized < 1 ) { @@ -265,9 +276,9 @@ int opal_cr_init(void ) opal_cr_thread_sleep_check = val; mca_base_param_reg_int_name("opal_cr", "thread_sleep_wait", - "Time to sleep waiting for process to exit MPI library (Default: 0)", + "Time to sleep waiting for process to exit MPI library (Default: 1000)", false, false, - 0, &val); + 1000, &val); opal_cr_thread_sleep_wait = val; opal_output_verbose(10, opal_cr_output, @@ -285,6 +296,19 @@ int opal_cr_init(void ) opal_output_verbose(10, opal_cr_output, "opal_cr: init: Is a tool program: %d", val); +#if OPAL_ENABLE_CRDEBUG == 1 + mca_base_param_reg_int_name("opal_cr", "enable_crdebug", + "Enable checkpoint/restart debugging", + false, false, + 0, + &val); + MPIR_debug_with_checkpoint = OPAL_INT_TO_BOOL(val); + + opal_output_verbose(10, opal_cr_output, + "opal_cr: init: C/R Debugging Enabled [%s]\n", + (MPIR_debug_with_checkpoint ? "True": "False")); +#endif + #ifndef __WINDOWS__ mca_base_param_reg_int_name("opal_cr", "signal", "Checkpoint/Restart signal used to initialize an OPAL Only checkpoint of a program", @@ -327,10 +351,36 @@ int opal_cr_init(void ) opal_cr_is_tool = true; /* no support for CR on Windows yet */ #endif /* __WINDOWS__ */ +#if OPAL_ENABLE_CRDEBUG == 1 + opal_cr_debug_num_free_threads = 3; + opal_cr_debug_free_threads = (opal_thread_t **)malloc(sizeof(opal_thread_t *) * opal_cr_debug_num_free_threads ); + for(t = 0; t < opal_cr_debug_num_free_threads; ++t ) { + opal_cr_debug_free_threads[t] = NULL; + } + + mca_base_param_reg_int_name("opal_cr", "crdebug_signal", + "Checkpoint/Restart signal used to hold threads when debugging", + false, false, + SIGTSTP, + &opal_cr_debug_signal); + + opal_output_verbose(10, opal_cr_output, + "opal_cr: init: Checkpoint Signal (Debug): %d", + opal_cr_debug_signal); + if( SIG_ERR == signal(opal_cr_debug_signal, MPIR_checkpoint_debugger_signal_handler) ) { + opal_output(opal_cr_output, + "opal_cr: init: Failed to register C/R debug signal (%d)", + opal_cr_debug_signal); + } +#else + /* Silence a compiler warning */ + t = 0; +#endif + mca_base_param_reg_string_name("opal_cr", "tmp_dir", "Temporary directory to place rendezvous files for a checkpoint", false, false, - "/tmp", + opal_tmp_directory(), &opal_cr_pipe_dir); opal_output_verbose(10, opal_cr_output, @@ -436,6 +486,14 @@ int opal_cr_finalize(void) opal_cr_checkpoint_request = OPAL_CR_STATUS_TERM; } +#if OPAL_ENABLE_CRDEBUG == 1 + if( NULL != opal_cr_debug_free_threads ) { + free( opal_cr_debug_free_threads ); + opal_cr_debug_free_threads = NULL; + } + opal_cr_debug_num_free_threads = 0; +#endif + if (NULL != opal_cr_pipe_dir) { free(opal_cr_pipe_dir); opal_cr_pipe_dir = NULL; @@ -523,6 +581,14 @@ int opal_cr_inc_core_prep(void) { int ret; + /* + * Call User Level INC + */ + if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_PRE_CRS_PRE_MPI, + OMPI_CR_INC_STATE_PREPARE)) ) { + return ret; + } + /* * Use the registered coordination routine */ @@ -535,6 +601,14 @@ int opal_cr_inc_core_prep(void) return ret; } + /* + * Call User Level INC + */ + if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_PRE_CRS_POST_MPI, + OMPI_CR_INC_STATE_PREPARE)) ) { + return ret; + } + core_prev_pid = getpid(); return OPAL_SUCCESS; @@ -575,7 +649,7 @@ int opal_cr_inc_core_ckpt(pid_t pid, * If restarting read environment stuff that opal-restart left us. */ if(*state == OPAL_CRS_RESTART) { - extract_env_vars(core_prev_pid); + opal_cr_refresh_environ(core_prev_pid); opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE; } @@ -585,6 +659,7 @@ int opal_cr_inc_core_ckpt(pid_t pid, int opal_cr_inc_core_recover(int state) { int ret; + opal_cr_user_inc_callback_state_t cb_state; if( opal_cr_checkpointing_state != OPAL_CR_STATUS_TERM && opal_cr_checkpointing_state != OPAL_CR_STATUS_CONTINUE && @@ -599,11 +674,29 @@ int opal_cr_inc_core_recover(int state) * If restarting read environment stuff that opal-restart left us. */ else if(state == OPAL_CRS_RESTART) { - extract_env_vars(core_prev_pid); + opal_cr_refresh_environ(core_prev_pid); opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE; } } + /* + * Call User Level INC + */ + if( OPAL_CRS_CONTINUE == state ) { + cb_state = OMPI_CR_INC_STATE_CONTINUE; + } + else if( OPAL_CRS_RESTART == state ) { + cb_state = OMPI_CR_INC_STATE_RESTART; + } + else { + cb_state = OMPI_CR_INC_STATE_ERROR; + } + + if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_POST_CRS_PRE_MPI, + cb_state)) ) { + return ret; + } + /* * Use the registered coordination routine */ @@ -616,6 +709,15 @@ int opal_cr_inc_core_recover(int state) return ret; } + if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_POST_CRS_POST_MPI, + cb_state)) ) { + return ret; + } + +#if OPAL_ENABLE_CRDEBUG == 1 + opal_cr_debug_clear_current_ckpt_thread(); +#endif + return OPAL_SUCCESS; } @@ -717,6 +819,39 @@ int opal_cr_reg_notify_callback(opal_cr_notify_callback_fn_t new_func, return OPAL_SUCCESS; } +int opal_cr_user_inc_register_callback(opal_cr_user_inc_callback_event_t event, + opal_cr_user_inc_callback_fn_t function, + opal_cr_user_inc_callback_fn_t *prev_function) +{ + if( event < 0 || event >= OMPI_CR_INC_MAX ) { + return OPAL_ERROR; + } + + if( NULL != cur_user_coord_callback[event] ) { + *prev_function = cur_user_coord_callback[event]; + } else { + *prev_function = NULL; + } + + cur_user_coord_callback[event] = function; + + return OPAL_SUCCESS; +} + +int trigger_user_inc_callback(opal_cr_user_inc_callback_event_t event, + opal_cr_user_inc_callback_state_t state) +{ + if( NULL == cur_user_coord_callback[event] ) { + return OPAL_SUCCESS; + } + + if( event < 0 || event >= OMPI_CR_INC_MAX ) { + return OPAL_ERROR; + } + + return ((cur_user_coord_callback[event])(event, state)); +} + int opal_cr_reg_coord_callback(opal_cr_coord_callback_fn_t new_func, opal_cr_coord_callback_fn_t *prev_func) { @@ -738,14 +873,61 @@ int opal_cr_reg_coord_callback(opal_cr_coord_callback_fn_t new_func, return OPAL_SUCCESS; } +int opal_cr_refresh_environ(int prev_pid) { + int val; + char *file_name = NULL; + struct stat file_status; + + if( 0 >= prev_pid ) { + prev_pid = getpid(); + } + + /* + * Make sure the file exists. If it doesn't then this means 2 things: + * 1) We have already executed this function, and + * 2) The file has been deleted on the previous round. + */ + asprintf(&file_name, "%s/%s-%d", opal_tmp_directory(), OPAL_CR_BASE_ENV_NAME, prev_pid); + if(0 != stat(file_name, &file_status) ){ + return OPAL_SUCCESS; + } + +#if OPAL_ENABLE_CRDEBUG == 1 + opal_unsetenv(mca_base_param_env_var("opal_cr_enable_crdebug"), &environ); +#endif + + extract_env_vars(prev_pid, file_name); + +#if OPAL_ENABLE_CRDEBUG == 1 + mca_base_param_reg_int_name("opal_cr", "enable_crdebug", + "Enable checkpoint/restart debugging", + false, false, + 0, + &val); + MPIR_debug_with_checkpoint = OPAL_INT_TO_BOOL(val); + + opal_output_verbose(10, opal_cr_output, + "opal_cr: init: C/R Debugging Enabled [%s] (refresh)\n", + (MPIR_debug_with_checkpoint ? "True": "False")); +#else + val = 0; /* Silence Compiler warning */ +#endif + + if( NULL != file_name ){ + free(file_name); + file_name = NULL; + } + + return OPAL_SUCCESS; +} + /* * Extract environment variables from a saved file * and place them in the environment. */ -static int extract_env_vars(int prev_pid) +static int extract_env_vars(int prev_pid, char * file_name) { int exit_status = OPAL_SUCCESS; - char *file_name = NULL; FILE *env_data = NULL; int len = OPAL_PATH_MAX; char * tmp_str = NULL; @@ -758,12 +940,6 @@ static int extract_env_vars(int prev_pid) goto cleanup; } - /* - * JJH: Hardcode /tmp here, really only need an agreed upon file to - * transfer the environment variables. - */ - asprintf(&file_name, "/tmp/%s-%d", OPAL_CR_BASE_ENV_NAME, prev_pid); - if (NULL == (env_data = fopen(file_name, "r")) ) { exit_status = OPAL_ERROR; goto cleanup; @@ -805,17 +981,12 @@ static int extract_env_vars(int prev_pid) tmp_str = NULL; } - cleanup: if( NULL != env_data ) { fclose(env_data); } unlink(file_name); - if( NULL != file_name ){ - free(file_name); - } - if( NULL != tmp_str ){ free(tmp_str); } @@ -871,6 +1042,10 @@ static void* opal_cr_thread_fn(opal_object_t *obj) } } +#if OPAL_ENABLE_CRDEBUG == 1 + opal_cr_debug_free_threads[1] = opal_thread_get_self(); +#endif + /* * Wait to become active */ @@ -1106,3 +1281,129 @@ void opal_cr_display_all_timers(void) opal_output(0, "OPAL CR Timing: ******************** Summary End\n"); } + +#if OPAL_ENABLE_CRDEBUG == 1 +int opal_cr_debug_set_current_ckpt_thread_self(void) +{ + int t; + + if( NULL == opal_cr_debug_free_threads ) { + opal_cr_debug_num_free_threads = 3; + opal_cr_debug_free_threads = (opal_thread_t **)malloc(sizeof(opal_thread_t *) * opal_cr_debug_num_free_threads ); + for(t = 0; t < opal_cr_debug_num_free_threads; ++t ) { + opal_cr_debug_free_threads[t] = NULL; + } + } + + opal_cr_debug_free_threads[0] = opal_thread_get_self(); + + return OPAL_SUCCESS; +} + +int opal_cr_debug_clear_current_ckpt_thread(void) +{ + opal_cr_debug_free_threads[0] = NULL; + + return OPAL_SUCCESS; +} + +int MPIR_checkpoint_debugger_detach(void) { + /* This function is meant to be a noop function for checkpoint/restart + * enabled debugging functionality */ +#if 0 + /* Once the debugger can successfully force threads into the function below, + * then we can uncomment this line */ + if( MPIR_debug_with_checkpoint ) { + opal_cr_debug_threads_already_waiting = true; + } +#endif + return OPAL_SUCCESS; +} + +void MPIR_checkpoint_debugger_signal_handler(int signo) +{ + opal_output_verbose(1, opal_cr_output, + "crs: MPIR_checkpoint_debugger_signal_handler(): Enter Debug signal handler..."); + + MPIR_checkpoint_debugger_waitpoint(); + + opal_output_verbose(1, opal_cr_output, + "crs: MPIR_checkpoint_debugger_signal_handler(): Leave Debug signal handler..."); +} + +void *MPIR_checkpoint_debugger_waitpoint(void) +{ + int t; + opal_thread_t *thr = NULL; + + thr = opal_thread_get_self(); + + /* + * Sanity check, if the debugger is not going to attach, then do not wait + * Make sure to open the debug gate, so that threads can get out + */ + if( !MPIR_debug_with_checkpoint ) { + opal_output_verbose(1, opal_cr_output, + "crs: MPIR_checkpoint_debugger_waitpoint(): Debugger is not attaching... (%d)", + (int)thr->t_handle); + MPIR_checkpoint_debug_gate = 1; + return NULL; + } + else { + opal_output_verbose(1, opal_cr_output, + "crs: MPIR_checkpoint_debugger_waitpoint(): Waiting for the Debugger to attach... (%d)", + (int)thr->t_handle); + MPIR_checkpoint_debug_gate = 0; + } + + /* + * Let special threads escape without waiting, they will wait later + */ + for(t = 0; t < opal_cr_debug_num_free_threads; ++t) { + if( opal_cr_debug_free_threads[t] != NULL && + opal_thread_self_compare(opal_cr_debug_free_threads[t]) ) { + opal_output_verbose(1, opal_cr_output, + "crs: MPIR_checkpoint_debugger_waitpoint(): Checkpointing thread does not wait here... (%d)", + (int)thr->t_handle); + return NULL; + } + } + + /* + * Force all other threads into the waiting function, + * unless they are already in there, then just return so we do not nest + * calls into this wait function and potentially confuse the debugger. + */ + if( opal_cr_debug_threads_already_waiting ) { + opal_output_verbose(1, opal_cr_output, + "crs: MPIR_checkpoint_debugger_waitpoint(): Threads are already waiting from debugger detach, do not wait here... (%d)", + (int)thr->t_handle); + return NULL; + } else { + opal_output_verbose(1, opal_cr_output, + "crs: MPIR_checkpoint_debugger_waitpoint(): Wait... (%d)", + (int)thr->t_handle); + return MPIR_checkpoint_debugger_breakpoint(); + } +} + +/* + * A tight loop to wait for debugger to release this process from the + * breakpoint. + */ +void *MPIR_checkpoint_debugger_breakpoint(void) +{ + /* spin until debugger attaches and releases us */ + while (MPIR_checkpoint_debug_gate == 0) { +#if defined(__WINDOWS__) + Sleep(100); /* milliseconds */ +#elif defined(HAVE_USLEEP) + usleep(100000); /* microseconds */ +#else + sleep(1); /* seconds */ +#endif + } + opal_cr_debug_threads_already_waiting = false; + return NULL; +} +#endif diff --git a/opal/runtime/opal_cr.h b/opal/runtime/opal_cr.h index 9bcd593679..93dea31ced 100644 --- a/opal/runtime/opal_cr.h +++ b/opal/runtime/opal_cr.h @@ -91,6 +91,44 @@ typedef enum opal_cr_ckpt_cmd_state_t opal_cr_ckpt_cmd_state_t; /* The current state of a checkpoint operation */ OPAL_DECLSPEC extern int opal_cr_checkpointing_state; +#if OPAL_ENABLE_CRDEBUG == 1 + /* Whether or not C/R Debugging is enabled for this process */ + OPAL_DECLSPEC extern int MPIR_debug_with_checkpoint; + + /* + * Set/clear the current thread id for the checkpointing thread + */ + OPAL_DECLSPEC int opal_cr_debug_set_current_ckpt_thread_self(void); + OPAL_DECLSPEC int opal_cr_debug_clear_current_ckpt_thread(void); + + /* + * This MPI Debugger function needs to be accessed here and have a specific + * name. Thus we are breaking the traditional naming conventions to provide this functionality. + */ + OPAL_DECLSPEC int MPIR_checkpoint_debugger_detach(void); + + /** + * A tight loop to wait for debugger to release this process from the + * breakpoint. + */ + OPAL_DECLSPEC void *MPIR_checkpoint_debugger_breakpoint(void); + + /** + * A function for the debugger or CRS to force all threads into + */ + OPAL_DECLSPEC void *MPIR_checkpoint_debugger_waitpoint(void); + + /** + * A signal handler to force all threads to wait when debugger detaches + */ + OPAL_DECLSPEC void MPIR_checkpoint_debugger_signal_handler(int signo); +#endif + + /* + * Refresh environment variables after a restart + */ + OPAL_DECLSPEC int opal_cr_refresh_environ(int prev_pid); + /* * If this is an application that doesn't want to have * a notification callback installed, set this to false. @@ -253,6 +291,42 @@ typedef enum opal_cr_ckpt_cmd_state_t opal_cr_ckpt_cmd_state_t; int *state); OPAL_DECLSPEC int opal_cr_inc_core_recover(int state); + + /******************************* + * User Coordination Routines + *******************************/ + typedef enum { + OMPI_CR_INC_PRE_CRS_PRE_MPI = 0, + OMPI_CR_INC_PRE_CRS_POST_MPI = 1, + OMPI_CR_INC_CRS_PRE_CKPT = 2, + OMPI_CR_INC_CRS_POST_CKPT = 3, + OMPI_CR_INC_POST_CRS_PRE_MPI = 4, + OMPI_CR_INC_POST_CRS_POST_MPI = 5, + OMPI_CR_INC_MAX = 6 + } opal_cr_user_inc_callback_event_t; + + typedef enum { + OMPI_CR_INC_STATE_PREPARE = 0, + OMPI_CR_INC_STATE_CONTINUE = 1, + OMPI_CR_INC_STATE_RESTART = 2, + OMPI_CR_INC_STATE_ERROR = 3 + } opal_cr_user_inc_callback_state_t; + + /** + * User coordination callback routine + */ + typedef int (*opal_cr_user_inc_callback_fn_t)(opal_cr_user_inc_callback_event_t event, + opal_cr_user_inc_callback_state_t state); + + OPAL_DECLSPEC int opal_cr_user_inc_register_callback + (opal_cr_user_inc_callback_event_t event, + opal_cr_user_inc_callback_fn_t function, + opal_cr_user_inc_callback_fn_t *prev_function); + + OPAL_DECLSPEC int trigger_user_inc_callback(opal_cr_user_inc_callback_event_t event, + opal_cr_user_inc_callback_state_t state); + + /******************************* * Coordination Routines *******************************/ diff --git a/opal/runtime/opal_finalize.c b/opal/runtime/opal_finalize.c index 98f363123f..b897427e30 100644 --- a/opal/runtime/opal_finalize.c +++ b/opal/runtime/opal_finalize.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -43,6 +43,9 @@ #include "opal/event/event.h" #include "opal/runtime/opal_progress.h" #include "opal/mca/carto/base/base.h" +#if OPAL_ENABLE_FT_CR == 1 +#include "opal/mca/compress/base/base.h" +#endif #include "opal/runtime/opal_cr.h" #include "opal/mca/crs/base/base.h" @@ -112,6 +115,10 @@ opal_finalize(void) /* close the checkpoint and restart service */ opal_cr_finalize(); +#if OPAL_ENABLE_FT_CR == 1 + opal_compress_base_close(); +#endif + opal_progress_finalize(); opal_event_fini(); diff --git a/opal/runtime/opal_init.c b/opal/runtime/opal_init.c index 65905a8438..40ea34ef34 100644 --- a/opal/runtime/opal_init.c +++ b/opal/runtime/opal_init.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -40,6 +40,9 @@ #include "opal/mca/memchecker/base/base.h" #include "opal/dss/dss.h" #include "opal/mca/carto/base/base.h" +#if OPAL_ENABLE_FT_CR == 1 +#include "opal/mca/compress/base/base.h" +#endif #include "opal/runtime/opal_cr.h" #include "opal/mca/crs/base/base.h" @@ -425,6 +428,23 @@ opal_init(int* pargc, char*** pargv) /* we want to tick the event library whenever possible */ opal_progress_event_users_increment(); +#if OPAL_ENABLE_FT_CR == 1 + /* + * Initialize the compression framework + * Note: Currently only used in C/R so it has been marked to only + * initialize when C/R is enabled. If other places in the code + * wish to use this framework, it is safe to remove the protection. + */ + if( OPAL_SUCCESS != (ret = opal_compress_base_open()) ) { + error = "opal_compress_base_open() failed"; + goto return_error; + } + if( OPAL_SUCCESS != (ret = opal_compress_base_select()) ) { + error = "opal_compress_base_select() failed"; + goto return_error; + } +#endif + /* * Initalize the checkpoint/restart functionality * Note: Always do this so we can detect if the user diff --git a/opal/tools/opal-restart/help-opal-restart.txt b/opal/tools/opal-restart/help-opal-restart.txt index 19efc2ff19..f0c4ec7738 100644 --- a/opal/tools/opal-restart/help-opal-restart.txt +++ b/opal/tools/opal-restart/help-opal-restart.txt @@ -21,7 +21,7 @@ # This is the US/English help file for Open MPI checkpoint tool # [usage] -opal-restart FILENAME +opal-restart -r FILENAME Open PAL Single Process Restart Tool %s @@ -70,3 +70,10 @@ Error: The restart command failed to properly exec the process per Expected Component: %s Selected Component: %s + +[cache_not_avail] +Warning: Recommended cache directory could not be accessed. Falling back + to the snapshot location. +Cache Dir : %s +Snapshot Dir: %s + diff --git a/opal/tools/opal-restart/opal-restart.c b/opal/tools/opal-restart/opal-restart.c index 36a00b6214..3f0d591b65 100644 --- a/opal/tools/opal-restart/opal-restart.c +++ b/opal/tools/opal-restart/opal-restart.c @@ -61,6 +61,7 @@ #include "opal/util/show_help.h" #include "opal/util/output.h" #include "opal/util/opal_environ.h" +#include "opal/util/basename.h" #include "opal/mca/base/base.h" #include "opal/mca/base/mca_base_param.h" @@ -70,14 +71,17 @@ #include "opal/mca/crs/crs.h" #include "opal/mca/crs/base/base.h" +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" + /****************** * Local Functions ******************/ static int initialize(int argc, char *argv[]); static int finalize(void); static int parse_args(int argc, char *argv[]); -static int check_file(char *given_filename); -static int post_env_vars(int prev_pid, char *location); +static int check_file(void); +static int post_env_vars(int prev_pid, opal_crs_base_snapshot_t *snapshot); /***************************************** * Global Vars for Command line Arguments @@ -86,10 +90,13 @@ static char *expected_crs_comp = NULL; typedef struct { bool help; - char *filename; bool verbose; - bool forked; + char *snapshot_ref; char *snapshot_loc; + char *snapshot_metadata; + char *snapshot_cache; + char *snapshot_compress; + char *snapshot_compress_postfix; int output; } opal_restart_globals_t; @@ -108,20 +115,41 @@ opal_cmd_line_init_t cmd_line_opts[] = { &opal_restart_globals.verbose, OPAL_CMD_LINE_TYPE_BOOL, "Be Verbose" }, - { NULL, NULL, NULL, - '\0', NULL, "fork", - 0, - &opal_restart_globals.forked, OPAL_CMD_LINE_TYPE_BOOL, - "Fork off a new process which is the restarted process instead of " - "replacing opal_restart" }, - - { "crs", "base", "snapshot_dir", - 'w', NULL, "where", + { NULL, NULL, NULL, + 'l', NULL, "location", 1, &opal_restart_globals.snapshot_loc, OPAL_CMD_LINE_TYPE_STRING, - "Where to find the checkpoint files. In most cases this is automatically " - "detected, however if a custom location was specified to opal-checkpoint " - "then this argument is meant to match it."}, + "Full path to the location of the local snapshot."}, + + { NULL, NULL, NULL, + 'm', NULL, "metadata", + 1, + &opal_restart_globals.snapshot_metadata, OPAL_CMD_LINE_TYPE_STRING, + "Relative path (with respect to --location) to the metadata file."}, + + { NULL, NULL, NULL, + 'r', NULL, "reference", + 1, + &opal_restart_globals.snapshot_ref, OPAL_CMD_LINE_TYPE_STRING, + "Local snapshot reference."}, + + { NULL, NULL, NULL, + 'c', NULL, "cache", + 1, + &opal_restart_globals.snapshot_cache, OPAL_CMD_LINE_TYPE_STRING, + "Possible local cache of the snapshot reference."}, + + { NULL, NULL, NULL, + 'd', NULL, "decompress", + 1, + &opal_restart_globals.snapshot_compress, OPAL_CMD_LINE_TYPE_STRING, + "Decompression component to use."}, + + { NULL, NULL, NULL, + 'p', NULL, "decompress_postfix", + 1, + &opal_restart_globals.snapshot_compress_postfix, OPAL_CMD_LINE_TYPE_STRING, + "Decompression component postfix."}, /* End of list */ { NULL, NULL, NULL, @@ -151,9 +179,9 @@ main(int argc, char *argv[]) /* * Check for existence of the file, or program in the case of self */ - if( OPAL_SUCCESS != (ret = check_file(opal_restart_globals.filename) )) { + if( OPAL_SUCCESS != (ret = check_file() )) { opal_show_help("help-opal-restart.txt", "invalid_filename", true, - opal_restart_globals.filename); + opal_restart_globals.snapshot_ref); exit_status = ret; goto cleanup; } @@ -170,19 +198,35 @@ main(int argc, char *argv[]) * Make sure we are using the correct checkpointer */ if(NULL == expected_crs_comp) { - char * base = NULL; + char * full_metadata_path = NULL; + FILE * metadata = NULL; - base = opal_crs_base_get_snapshot_directory(opal_restart_globals.filename); - if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(base, + asprintf(&full_metadata_path, "%s/%s/%s", + opal_restart_globals.snapshot_loc, + opal_restart_globals.snapshot_ref, + opal_restart_globals.snapshot_metadata); + if( NULL == (metadata = fopen(full_metadata_path, "r")) ) { + opal_show_help("help-opal-restart.txt", "invalid_metadata", true, + opal_restart_globals.snapshot_metadata, + full_metadata_path); + exit_status = OPAL_ERROR; + goto cleanup; + } + if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(metadata, &expected_crs_comp, &prev_pid)) ) { opal_show_help("help-opal-restart.txt", "invalid_metadata", true, - opal_crs_base_metadata_filename, base); + opal_restart_globals.snapshot_metadata, + full_metadata_path); exit_status = ret; goto cleanup; } - free(base); + free(full_metadata_path); + full_metadata_path = NULL; + + fclose(metadata); + metadata = NULL; } opal_output_verbose(10, opal_restart_globals.output, @@ -235,21 +279,17 @@ main(int argc, char *argv[]) * Restart in this process ******************************/ opal_output_verbose(10, opal_restart_globals.output, - "Restarting from file (%s)", - opal_restart_globals.filename); - if( opal_restart_globals.forked ) { - opal_output_verbose(10, opal_restart_globals.output, - "\t Forking off a child"); - } else { - opal_output_verbose(10, opal_restart_globals.output, - "\t Exec in self"); - } + "Restarting from file (%s)\n", + opal_restart_globals.snapshot_ref); snapshot = OBJ_NEW(opal_crs_base_snapshot_t); - snapshot->cold_start = true; - snapshot->reference_name = strdup(opal_restart_globals.filename); - snapshot->local_location = opal_crs_base_get_snapshot_directory(snapshot->reference_name); - snapshot->remote_location = strdup(snapshot->local_location); + snapshot->cold_start = true; + asprintf(&(snapshot->snapshot_directory), "%s/%s", + opal_restart_globals.snapshot_loc, + opal_restart_globals.snapshot_ref); + asprintf(&(snapshot->metadata_filename), "%s/%s", + snapshot->snapshot_directory, + opal_restart_globals.snapshot_metadata); /* Since some checkpoint/restart systems don't pass along env vars to the * restarted app, we need to take care of that. @@ -257,7 +297,7 @@ main(int argc, char *argv[]) * Included here is the creation of any files or directories that need to be * created before the process is restarted. */ - if(OPAL_SUCCESS != (ret = post_env_vars(prev_pid, snapshot->local_location) ) ) { + if(OPAL_SUCCESS != (ret = post_env_vars(prev_pid, snapshot) ) ) { exit_status = ret; goto cleanup; } @@ -266,27 +306,16 @@ main(int argc, char *argv[]) * Do the actual restart */ ret = opal_crs.crs_restart(snapshot, - opal_restart_globals.forked, + false, &child_pid); if (OPAL_SUCCESS != ret) { opal_show_help("help-opal-restart.txt", "restart_cmd_failure", true, - opal_restart_globals.filename, ret); + opal_restart_globals.snapshot_ref, ret); exit_status = ret; goto cleanup; } - - /* If we required it to exec in self, then fail if this function returns. */ - if(!opal_restart_globals.forked) { - opal_show_help("help-opal-restart.txt", "failed-to-exec", true, - expected_crs_comp, - opal_crs_base_selected_component.base_version.mca_component_name); - exit_status = ret; - goto cleanup; - } - - opal_output_verbose(10, opal_restart_globals.output, - "opal_restart: Restarted Child with PID = %d\n", child_pid); + /* Should never get here, since crs_restart calls exec */ /*************** * Cleanup @@ -320,8 +349,8 @@ static int initialize(int argc, char *argv[]) * Parse Command line arguments */ if (OPAL_SUCCESS != (ret = parse_args(argc, argv))) { - goto cleanup; exit_status = ret; + goto cleanup; } /* @@ -345,6 +374,18 @@ static int initialize(int argc, char *argv[]) free(tmp_env_var); tmp_env_var = NULL; + /* + * Make sure we select the proper compress component. + */ + if( NULL != opal_restart_globals.snapshot_compress ) { + tmp_env_var = mca_base_param_env_var("compress"); + opal_setenv(tmp_env_var, + opal_restart_globals.snapshot_compress, + true, &environ); + free(tmp_env_var); + tmp_env_var = NULL; + } + /* * Initialize the OPAL layer */ @@ -353,6 +394,72 @@ static int initialize(int argc, char *argv[]) goto cleanup; } + /* + * If the checkpoint was compressed, then decompress it before continuing + */ + if( NULL != opal_restart_globals.snapshot_compress ) { + char * zip_dir = NULL; + char * tmp_str = NULL; + + /* Make sure to clear the selection for the restart, + * this way the user can swich compression mechanism + * across restart + */ + tmp_env_var = mca_base_param_env_var("compress"); + opal_unsetenv(tmp_env_var, &environ); + free(tmp_env_var); + tmp_env_var = NULL; + + asprintf(&zip_dir, "%s/%s%s", + opal_restart_globals.snapshot_loc, + opal_restart_globals.snapshot_ref, + opal_restart_globals.snapshot_compress_postfix); + + if (0 > (ret = access(zip_dir, F_OK)) ) { + opal_output(opal_restart_globals.output, + "Error: Unable to access the file [%s]!", + zip_dir); + exit_status = OPAL_ERROR; + goto cleanup; + } + + opal_output_verbose(10, opal_restart_globals.output, + "Decompressing (%s)", + zip_dir); + + opal_compress.decompress(zip_dir, &tmp_str); + + if( NULL != zip_dir ) { + free(zip_dir); + zip_dir = NULL; + } + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + } + + /* + * If a cache directory has been suggested, see if it exists + */ + if( NULL != opal_restart_globals.snapshot_cache ) { + if(0 == (ret = access(opal_restart_globals.snapshot_cache, F_OK)) ) { + opal_output_verbose(10, opal_restart_globals.output, + "Using the cached snapshot (%s) instead of (%s)", + opal_restart_globals.snapshot_cache, + opal_restart_globals.snapshot_loc); + if( NULL != opal_restart_globals.snapshot_loc ) { + free(opal_restart_globals.snapshot_loc); + opal_restart_globals.snapshot_loc = NULL; + } + opal_restart_globals.snapshot_loc = opal_dirname(opal_restart_globals.snapshot_cache); + } else { + opal_show_help("help-opal-restart.txt", "cache_not_avail", true, + opal_restart_globals.snapshot_cache, + opal_restart_globals.snapshot_loc); + } + } + /* * Mark this process as a tool */ @@ -380,10 +487,13 @@ static int parse_args(int argc, char *argv[]) char **app_env = NULL, **global_env = NULL; opal_restart_globals.help = false; - opal_restart_globals.filename = NULL; opal_restart_globals.verbose = false; - opal_restart_globals.forked = false; + opal_restart_globals.snapshot_ref = NULL; opal_restart_globals.snapshot_loc = NULL; + opal_restart_globals.snapshot_metadata = NULL; + opal_restart_globals.snapshot_cache = NULL; + opal_restart_globals.snapshot_compress = NULL; + opal_restart_globals.snapshot_compress_postfix = NULL; opal_restart_globals.output = 0; /* Parse the command line options */ @@ -412,8 +522,7 @@ static int parse_args(int argc, char *argv[]) * Now start parsing our specific arguments */ if (OPAL_SUCCESS != ret || - opal_restart_globals.help || - 1 >= argc) { + opal_restart_globals.help ) { char *args = NULL; args = opal_cmd_line_get_usage_msg(&cmd_line); opal_show_help("help-opal-restart.txt", "usage", true, @@ -424,20 +533,11 @@ static int parse_args(int argc, char *argv[]) /* get the remaining bits */ opal_cmd_line_get_tail(&cmd_line, &argc, &argv); - if ( 1 > argc ) { - char *args = NULL; - args = opal_cmd_line_get_usage_msg(&cmd_line); - opal_show_help("help-opal-restart.txt", "usage", true, - args); - free(args); - return OPAL_ERROR; - } - opal_restart_globals.filename = strdup(argv[0]); - if ( NULL == opal_restart_globals.filename || - 0 >= strlen(opal_restart_globals.filename) ) { + if ( NULL == opal_restart_globals.snapshot_ref || + 0 >= strlen(opal_restart_globals.snapshot_ref) ) { opal_show_help("help-opal-restart.txt", "invalid_filename", true, - opal_restart_globals.filename); + opal_restart_globals.snapshot_ref); return OPAL_ERROR; } @@ -445,21 +545,20 @@ static int parse_args(int argc, char *argv[]) * need to be grouped together. * Useful in the 'mca crs self' instance. */ - if(argc > 1) { - opal_restart_globals.filename = strdup(opal_argv_join(argv, ' ')); + if(argc > 0) { + opal_restart_globals.snapshot_ref = strdup(opal_argv_join(argv, ' ')); } return OPAL_SUCCESS; } -static int check_file(char *given_filename) +static int check_file(void) { int exit_status = OPAL_SUCCESS; int ret; char * path_to_check = NULL; - char **argv = NULL; - if(NULL == given_filename) { + if(NULL == opal_restart_globals.snapshot_ref) { opal_output(opal_restart_globals.output, "Error: No filename provided!"); exit_status = OPAL_ERROR; @@ -469,9 +568,10 @@ static int check_file(char *given_filename) /* * Check for the existance of the snapshot handle in the snapshot directory */ - path_to_check = opal_crs_base_get_snapshot_directory(given_filename); + asprintf(&path_to_check, "%s/%s", + opal_restart_globals.snapshot_loc, + opal_restart_globals.snapshot_ref); - /* Do the check */ opal_output_verbose(10, opal_restart_globals.output, "Checking for the existence of (%s)", path_to_check); @@ -485,15 +585,15 @@ static int check_file(char *given_filename) } cleanup: - if( NULL != path_to_check) + if( NULL != path_to_check) { free(path_to_check); - if( NULL != argv) - opal_argv_free(argv); + path_to_check = NULL; + } return exit_status; } -static int post_env_vars(int prev_pid, char *location) +static int post_env_vars(int prev_pid, opal_crs_base_snapshot_t *snapshot) { int ret, exit_status = OPAL_SUCCESS; char *command = NULL; @@ -511,11 +611,10 @@ static int post_env_vars(int prev_pid, char *location) } /* - * JJH: Hardcode /tmp to match opal/runtime/opal_cr.c in the application. * This is needed so we can pass the previous environment to the restarted * application process. */ - asprintf(&proc_file, "/tmp/%s-%d", OPAL_CR_BASE_ENV_NAME, prev_pid); + asprintf(&proc_file, "%s/%s-%d", opal_tmp_directory(), OPAL_CR_BASE_ENV_NAME, prev_pid); asprintf(&command, "env | grep OMPI_ > %s", proc_file); opal_output_verbose(5, opal_restart_globals.output, @@ -530,7 +629,14 @@ static int post_env_vars(int prev_pid, char *location) /* * Any directories that need to be created */ - opal_crs_base_metadata_read_token(location, CRS_METADATA_MKDIR, &loc_mkdir); + if( NULL == (snapshot->metadata = fopen(snapshot->metadata_filename, "r")) ) { + opal_show_help("help-opal-restart.txt", "invalid_metadata", true, + opal_restart_globals.snapshot_metadata, + snapshot->metadata_filename); + exit_status = OPAL_ERROR; + goto cleanup; + } + opal_crs_base_metadata_read_token(snapshot->metadata, CRS_METADATA_MKDIR, &loc_mkdir); argc = opal_argv_count(loc_mkdir); for( i = 0; i < argc; ++i ) { if( NULL != command ) { @@ -555,7 +661,7 @@ static int post_env_vars(int prev_pid, char *location) /* * Any files that need to exist */ - opal_crs_base_metadata_read_token(location, CRS_METADATA_TOUCH, &loc_touch); + opal_crs_base_metadata_read_token(snapshot->metadata, CRS_METADATA_TOUCH, &loc_touch); argc = opal_argv_count(loc_touch); for( i = 0; i < argc; ++i ) { if( NULL != command ) { @@ -594,6 +700,11 @@ static int post_env_vars(int prev_pid, char *location) opal_argv_free(loc_touch); loc_touch = NULL; } - + + if( NULL != snapshot->metadata ) { + fclose(snapshot->metadata); + snapshot->metadata = NULL; + } + return exit_status; } diff --git a/orte/config/config_files.m4 b/orte/config/config_files.m4 index a543915750..223768df16 100644 --- a/orte/config/config_files.m4 +++ b/orte/config/config_files.m4 @@ -1,6 +1,9 @@ # -*- shell-script -*- # # Copyright (c) 2009-2010 Cisco Systems, Inc. All rights reserved. +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. # $COPYRIGHT$ # # Additional copyrights may follow @@ -27,6 +30,7 @@ AC_DEFUN([ORTE_CONFIG_FILES],[ orte/tools/orte-clean/Makefile orte/tools/orte-top/Makefile orte/tools/orte-bootproxy/Makefile + orte/tools/orte-migrate/Makefile orte/tools/orte-info/Makefile ]) ]) diff --git a/orte/mca/errmgr/autor/Makefile.am b/orte/mca/errmgr/autor/Makefile.am new file mode 100644 index 0000000000..7b0bfa4823 --- /dev/null +++ b/orte/mca/errmgr/autor/Makefile.am @@ -0,0 +1,38 @@ +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +dist_pkgdata_DATA = help-orte-errmgr-autor.txt + +sources = \ + errmgr_autor.h \ + errmgr_autor_component.c \ + errmgr_autor_module.c + +# Make the output library in this directory, and name it either +# mca__.la (for DSO builds) or libmca__.la +# (for static builds). + +if OMPI_BUILD_errmgr_autor_DSO +component_noinst = +component_install = mca_errmgr_autor.la +else +component_noinst = libmca_errmgr_autor.la +component_install = +endif + +mcacomponentdir = $(pkglibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_errmgr_autor_la_SOURCES = $(sources) +mca_errmgr_autor_la_LDFLAGS = -module -avoid-version + +noinst_LTLIBRARIES = $(component_noinst) +libmca_errmgr_autor_la_SOURCES = $(sources) +libmca_errmgr_autor_la_LDFLAGS = -module -avoid-version diff --git a/orte/mca/errmgr/autor/configure.m4 b/orte/mca/errmgr/autor/configure.m4 new file mode 100644 index 0000000000..9666c15dd4 --- /dev/null +++ b/orte/mca/errmgr/autor/configure.m4 @@ -0,0 +1,20 @@ +# -*- shell-script -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# MCA_errmgr_autor_CONFIG([action-if-found], [action-if-not-found]) +# ----------------------------------------------------------- +AC_DEFUN([MCA_errmgr_autor_CONFIG],[ + # If we don't want FT, don't compile this component + AS_IF([test "$opal_want_ft_cr" = "1"], + [$1], + [$2]) +])dnl diff --git a/orte/mca/errmgr/autor/configure.params b/orte/mca/errmgr/autor/configure.params new file mode 100644 index 0000000000..df6f06b88a --- /dev/null +++ b/orte/mca/errmgr/autor/configure.params @@ -0,0 +1,14 @@ +# -*- shell-script -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_INIT_FILE=errmgr_autor_component.c +PARAM_CONFIG_FILES="Makefile" diff --git a/orte/mca/errmgr/autor/errmgr_autor.h b/orte/mca/errmgr/autor/errmgr_autor.h new file mode 100644 index 0000000000..605ea2fbcf --- /dev/null +++ b/orte/mca/errmgr/autor/errmgr_autor.h @@ -0,0 +1,88 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * + * Automatic Recovery Errmgr component + * + */ + +#ifndef MCA_ERRMGR_AUTOR_EXPORT_H +#define MCA_ERRMGR_AUTOR_EXPORT_H + +#include "orte_config.h" + +#include "opal/mca/mca.h" +#include "opal/event/event.h" + +#include "orte/mca/filem/filem.h" +#include "orte/mca/errmgr/errmgr.h" + +BEGIN_C_DECLS + + /* + * Local Component structures + */ + struct orte_errmgr_autor_component_t { + orte_errmgr_base_component_t super; /** Base Errmgr component */ + bool autor_enabled; + bool timing_enabled; + int recovery_delay; + bool skip_oldnode; + }; + typedef struct orte_errmgr_autor_component_t orte_errmgr_autor_component_t; + OPAL_MODULE_DECLSPEC extern orte_errmgr_autor_component_t mca_errmgr_autor_component; + + int orte_errmgr_autor_component_query(mca_base_module_t **module, int *priority); + + /* + * Module functions: Global + */ + int orte_errmgr_autor_global_module_init(void); + int orte_errmgr_autor_global_module_finalize(void); + + int orte_errmgr_autor_global_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_autor_global_process_fault(orte_job_t *jdata, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_autor_global_suggest_map_targets(orte_proc_t *proc, + orte_node_t *oldnode, + opal_list_t *node_list, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_autor_global_ft_event(int state); + + /* + * Module functions: Local (Daemon) + */ + int orte_errmgr_autor_local_module_init(void); + int orte_errmgr_autor_local_module_finalize(void); + + int orte_errmgr_autor_local_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_autor_local_ft_event(int state); + + +END_C_DECLS + +#endif /* MCA_ERRMGR_AUTOR_EXPORT_H */ diff --git a/orte/mca/errmgr/autor/errmgr_autor_component.c b/orte/mca/errmgr/autor/errmgr_autor_component.c new file mode 100644 index 0000000000..87a901b2fa --- /dev/null +++ b/orte/mca/errmgr/autor/errmgr_autor_component.c @@ -0,0 +1,161 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" +#include "opal/util/output.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" +#include "errmgr_autor.h" + +/* + * Public string for version number + */ +const char *orte_errmgr_autor_component_version_string = + "ORTE ERRMGR AutoR MCA component version " ORTE_VERSION; + +/* + * Local functionality + */ +static int errmgr_autor_open(void); +static int errmgr_autor_close(void); + +/* + * Instantiate the public struct with all of our public information + * and pointer to our public functions in it + */ +orte_errmgr_autor_component_t mca_errmgr_autor_component = { + /* First do the base component stuff */ + { + /* Handle the general mca_component_t struct containing + * meta information about the component itautor + */ + { + ORTE_ERRMGR_BASE_VERSION_3_0_0, + /* Component name and version */ + "autor", + ORTE_MAJOR_VERSION, + ORTE_MINOR_VERSION, + ORTE_RELEASE_VERSION, + + /* Component open and close functions */ + errmgr_autor_open, + errmgr_autor_close, + orte_errmgr_autor_component_query + }, + { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + /* Verbosity level */ + 0, + /* opal_output handler */ + -1, + /* Default priority */ + 20 + } +}; + +static int errmgr_autor_open(void) +{ + int val; + + /* + * This should be the last componet to ever get used since + * it doesn't do anything. + */ + mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version, + "priority", + "Priority of the ERRMGR autor component", + false, false, + mca_errmgr_autor_component.super.priority, + &mca_errmgr_autor_component.super.priority); + + mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version, + "verbose", + "Verbose level for the ERRMGR autor component", + false, false, + mca_errmgr_autor_component.super.verbose, + &mca_errmgr_autor_component.super.verbose); + /* If there is a custom verbose level for this component than use it + * otherwise take our parents level and output channel + */ + if ( 0 != mca_errmgr_autor_component.super.verbose) { + mca_errmgr_autor_component.super.output_handle = opal_output_open(NULL); + opal_output_set_verbosity(mca_errmgr_autor_component.super.output_handle, + mca_errmgr_autor_component.super.verbose); + } else { + mca_errmgr_autor_component.super.output_handle = orte_errmgr_base.output; + } + + mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version, + "timing", + "Enable Automatic Recovery timer", + false, false, + 0, &val); + mca_errmgr_autor_component.timing_enabled = OPAL_INT_TO_BOOL(val); + + mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version, + "enable", + "Enable Automatic Recovery (Default: 0/off)", + false, false, + 0, &val); + mca_errmgr_autor_component.autor_enabled = OPAL_INT_TO_BOOL(val); + + mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version, + "recovery_delay", + "Number of seconds to wait before starting to recover the job after a failure" + " [Default: 1 sec]", + false, false, + 1, &val); + mca_errmgr_autor_component.recovery_delay = val; + + mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version, + "skip_oldnode", + "Skip the old node from failed proc, even if it is still available" + " [Default: Enabled]", + false, false, + 1, &val); + mca_errmgr_autor_component.skip_oldnode = OPAL_INT_TO_BOOL(val); + + /* + * Debug Output + */ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: open()"); + opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: open: priority = %d", + mca_errmgr_autor_component.super.priority); + opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: open: verbosity = %d", + mca_errmgr_autor_component.super.verbose); + opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: open: timing = %s", + (mca_errmgr_autor_component.timing_enabled ? "Enabled" : "Disabled")); + opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: open: Auto. Recover = %s", + (mca_errmgr_autor_component.autor_enabled ? "Enabled" : "Disabled")); + opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: open: recover_delay = %d", + mca_errmgr_autor_component.recovery_delay); + + return ORTE_SUCCESS; +} + +static int errmgr_autor_close(void) +{ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: close()"); + + return ORTE_SUCCESS; +} diff --git a/orte/mca/errmgr/autor/errmgr_autor_module.c b/orte/mca/errmgr/autor/errmgr_autor_module.c new file mode 100644 index 0000000000..973df1830c --- /dev/null +++ b/orte/mca/errmgr/autor/errmgr_autor_module.c @@ -0,0 +1,1194 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" + +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ +#ifdef HAVE_STRING_H +#include +#endif + +#include "opal/util/show_help.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/argv.h" +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" + +#include "orte/util/error_strings.h" +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "opal/dss/dss.h" +#include "orte/mca/rml/rml.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/routed/routed.h" +#include "orte/mca/iof/iof.h" +#include "orte/mca/plm/plm.h" +#include "orte/mca/plm/base/base.h" +#include "orte/mca/plm/base/plm_private.h" +#include "orte/mca/filem/filem.h" +#include "orte/mca/grpcomm/grpcomm.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/rmaps/rmaps_types.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/mca/snapc/base/base.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" + +#include "errmgr_autor.h" + +#include MCA_timer_IMPLEMENTATION_HEADER + + +/****************** + * Automatic Recovery module + ******************/ +static orte_errmgr_base_module_t global_module = { + /** Initialization Function */ + orte_errmgr_autor_global_module_init, + /** Finalization Function */ + orte_errmgr_autor_global_module_finalize, + /** Update State */ + orte_errmgr_autor_global_update_state, + NULL, /** predicted_fault */ + /*orte_errmgr_autor_global_process_fault,*/ + orte_errmgr_autor_global_suggest_map_targets, + orte_errmgr_autor_global_ft_event +}; + +static orte_errmgr_base_module_t local_module = { + /** Initialization Function */ + orte_errmgr_autor_local_module_init, + /** Finalization Function */ + orte_errmgr_autor_local_module_finalize, + /** Update State */ + orte_errmgr_autor_local_update_state, + NULL, /** predicted_fault */ + /*orte_errmgr_autor_local_process_fault,*/ + NULL, /* suggest_map_targets */ + orte_errmgr_autor_local_ft_event +}; + +/************************ + * Work Pool structures + ************************/ +struct errmgr_autor_wp_item_t { + /** List super object */ + opal_list_item_t super; + + /** ORTE Process name */ + orte_process_name_t name; + + /** State that was passed with it */ + orte_proc_state_t state; +}; +typedef struct errmgr_autor_wp_item_t errmgr_autor_wp_item_t; + +OBJ_CLASS_DECLARATION(errmgr_autor_wp_item_t); + +void errmgr_autor_wp_item_construct(errmgr_autor_wp_item_t *wp); +void errmgr_autor_wp_item_destruct(errmgr_autor_wp_item_t *wp); + +OBJ_CLASS_INSTANCE(errmgr_autor_wp_item_t, + opal_list_item_t, + errmgr_autor_wp_item_construct, + errmgr_autor_wp_item_destruct); + +/************************************ + * Locally Global vars & functions :) + ************************************/ +static orte_jobid_t current_global_jobid = ORTE_JOBID_INVALID; +static orte_job_t *current_global_jobdata = NULL; + +static bool autor_mask_faults = false; + +static opal_list_t *procs_pending_recovery = NULL; +static bool autor_timer_active = false; +static opal_event_t *autor_timer_event = NULL; + +static void errmgr_autor_recover_processes(int fd, short event, void *cbdata); +static int autor_set_current_job_info(orte_job_t *given_jdata, orte_process_name_t *proc_name); + +static int display_procs(void ); +static int autor_procs_sort_compare_fn(opal_list_item_t **a, + opal_list_item_t **b); +static void errmgr_autor_process_fault_app(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state); +static void errmgr_autor_process_fault_daemon(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state); +static int check_if_terminated(opal_pointer_array_t *procs); +static int check_if_restarted(opal_pointer_array_t *procs); + +static void update_proc(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_exit_code_t exit_code); + +/* + * Timer stuff + */ +static void errmgr_autor_set_time(int idx); +static void errmgr_autor_display_all_timers(void); +static void errmgr_autor_clear_timers(void); + +static double errmgr_autor_get_time(void); +static void errmgr_autor_display_indv_timer_core(double diff, char *str); +static double timer_start[OPAL_CR_TIMER_MAX]; + +#define ERRMGR_AUTOR_TIMER_START 0 +#define ERRMGR_AUTOR_TIMER_SETUP 1 +#define ERRMGR_AUTOR_TIMER_TERM 2 +#define ERRMGR_AUTOR_TIMER_RESETUP 3 +#define ERRMGR_AUTOR_TIMER_RESTART 4 +#define ERRMGR_AUTOR_TIMER_FINISH 5 +#define ERRMGR_AUTOR_TIMER_MAX 6 + +#define ERRMGR_AUTOR_CLEAR_TIMERS() \ + { \ + if(OPAL_UNLIKELY(mca_errmgr_autor_component.timing_enabled > 0)) { \ + errmgr_autor_clear_timers(); \ + } \ + } + +#define ERRMGR_AUTOR_SET_TIMER(idx) \ + { \ + if(OPAL_UNLIKELY(mca_errmgr_autor_component.timing_enabled > 0)) { \ + errmgr_autor_set_time(idx); \ + } \ + } + +#define ERRMGR_AUTOR_DISPLAY_ALL_TIMERS() \ + { \ + if(OPAL_UNLIKELY(mca_errmgr_autor_component.timing_enabled > 0)) { \ + errmgr_autor_display_all_timers(); \ + } \ + } + +/************************ + * Function Definitions + ************************/ +/* + * MCA Functions + */ +int orte_errmgr_autor_component_query(mca_base_module_t **module, int *priority) +{ + if( !(orte_enable_recovery) ) { + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:component_query() - Disabled: Recovery is not enabled"); + *priority = -1; + *module = NULL; + return ORTE_SUCCESS; + } + + if( !mca_errmgr_autor_component.autor_enabled ) { + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor: component_query() - Disabled: C/R Automatic Recovery " + "is not enabled via errmgr_autor_enable MCA parameter."); + *priority = -1; + *module = NULL; + return ORTE_SUCCESS; + } + + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:component_query()"); + + *priority = mca_errmgr_autor_component.super.priority; + if( ORTE_PROC_IS_HNP ) { + *module = (mca_base_module_t *)&global_module; + } + else if (ORTE_PROC_IS_DAEMON) { + *module = (mca_base_module_t *)&local_module; + } + else { + *module = NULL; + } + + return ORTE_SUCCESS; +} + +/************************ + * Function Definitions: Global + ************************/ +int orte_errmgr_autor_global_module_init(void) +{ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:init()"); + + procs_pending_recovery = OBJ_NEW(opal_list_t); + autor_timer_event = (opal_event_t*)malloc(sizeof(opal_event_t)); + + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + ERRMGR_AUTOR_CLEAR_TIMERS(); + + return ORTE_SUCCESS; +} + +int orte_errmgr_autor_global_module_finalize(void) +{ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:finalize()"); + + if( NULL != procs_pending_recovery ) { + OBJ_RELEASE(procs_pending_recovery); + procs_pending_recovery = NULL; + } + if( NULL != autor_timer_event ) { + free(autor_timer_event); + } + + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + ERRMGR_AUTOR_CLEAR_TIMERS(); + + return ORTE_SUCCESS; +} + +static int autor_set_current_job_info(orte_job_t *given_jdata, orte_process_name_t *proc_name) +{ + orte_job_t *jdata = NULL; + int i; + + /* + * If we already figured it out, then just move ahead + */ + if( NULL != current_global_jobdata ) { + if( given_jdata->jobid != ORTE_PROC_MY_NAME->jobid && + given_jdata->jobid != current_global_jobdata->jobid ) { + current_global_jobdata = given_jdata; + current_global_jobid = given_jdata->jobid; + } + return ORTE_SUCCESS; + } + + /* + * If this references the application, and not the daemons + */ + if( given_jdata->jobid != ORTE_PROC_MY_NAME->jobid ) { + current_global_jobdata = given_jdata; + current_global_jobid = given_jdata->jobid; + return ORTE_SUCCESS; + } + + /* + * Otherwise iterate through the job structure and find the first job. + */ + for(i = 0; i < orte_job_data->size; ++i ) { + if (NULL == (jdata = (orte_job_t*)opal_pointer_array_get_item(orte_job_data, i))) { + continue; + } + /* Exclude outselves */ + if( jdata->jobid == ORTE_PROC_MY_NAME->jobid ) { + continue; + } + current_global_jobdata = jdata; + current_global_jobid = jdata->jobid; + break; + } + + if( NULL == current_global_jobdata ) { + opal_output(0, "errmgr:autor:process_fault(): Global) Error: Cannot find the jdata for the current job."); + return ORTE_ERROR; + } + + return ORTE_SUCCESS; +} + +int orte_errmgr_autor_global_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state) +{ + orte_proc_t *loc_proc = NULL; + orte_job_t *jdata = NULL; + int ret = ORTE_SUCCESS, exit_status = ORTE_SUCCESS; + int32_t i; + + /* + * if orte is trying to shutdown, just let it + */ + if (orte_finalizing) { + return ORTE_SUCCESS; + } + + if( NULL != proc_name && + OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_NAME, proc_name) ) { + OPAL_OUTPUT_VERBOSE((1, orte_errmgr_base.output, + "%s errmgr:autor: Update reported on self (%s), state %s. Skip...", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(proc_name), + orte_proc_state_to_str(state) )); + return ORTE_SUCCESS; + } + + OPAL_OUTPUT_VERBOSE((1, orte_errmgr_base.output, + "%s errmgr:autor: job %s reported state %s" + " for proc %s state %s exit_code %d (%c)", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_JOBID_PRINT(job), + orte_job_state_to_str(jobstate), + (NULL == proc_name) ? "NULL" : ORTE_NAME_PRINT(proc_name), + orte_proc_state_to_str(state), exit_code, + (orte_finalizing ? 'T' : 'F'))); + + /* get the job data object for this process */ + if (NULL == (jdata = orte_get_job_data_object(job))) { + ret = ORTE_ERROR; + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * If this job opt'ed not to be recovered, then skip + */ + if( !(jdata->enable_recovery) ) { + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + if( ORTE_JOB_STATE_RESTART == jobstate ) { + + for(i = 0; i < jdata->procs->size; ++i) { + if (NULL == (loc_proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, i))) { + continue; + } + break; + } + + /*state = ORTE_PROC_STATE_KILLED_BY_CMD;*/ + if( ORTE_SUCCESS != (ret = orte_errmgr_autor_global_process_fault(jdata, &(loc_proc->name), state, stack_state)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + else if( ORTE_PROC_STATE_ABORTED_BY_SIG == state || + ORTE_PROC_STATE_COMM_FAILED == state ) { + if( ORTE_SUCCESS != (ret = orte_errmgr_autor_global_process_fault(jdata, proc_name, state, stack_state)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + else if( ORTE_PROC_STATE_KILLED_BY_CMD == state ) { + if( autor_mask_faults ) { + update_proc(jdata, proc_name, state, exit_code); + *stack_state ^= ORTE_ERRMGR_STACK_STATE_JOB_ABORT; + *stack_state |= ORTE_ERRMGR_STACK_STATE_RECOVERED; + } + } + + cleanup: + return ret; +} + +int orte_errmgr_autor_global_process_fault(orte_job_t *jdata, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state) +{ + int ret; + + /* + * Recover from the process failure by relaunching. + */ + if( ORTE_SUCCESS != (ret = autor_set_current_job_info(jdata, proc_name)) ) { + ORTE_ERROR_LOG(ret); + return ORTE_SUCCESS; /* JJH: Do this for now. Need to fix the flag for normal shutdown */ + /*return ret;*/ + } + + current_global_jobdata->controls |= ORTE_JOB_CONTROL_RECOVERABLE; + + if( proc_name->jobid == ORTE_PROC_MY_NAME->jobid ) { + errmgr_autor_process_fault_daemon(jdata, proc_name, state, stack_state); + } else { + update_proc(jdata, proc_name, state, 0); + errmgr_autor_process_fault_app(jdata, proc_name, state, stack_state); + } + + return ORTE_SUCCESS; +} + +int orte_errmgr_autor_global_suggest_map_targets(orte_proc_t *proc, + orte_node_t *oldnode, + opal_list_t *node_list, + orte_errmgr_stack_state_t *stack_state) +{ + opal_list_item_t *item = NULL; + errmgr_autor_wp_item_t *wp_item = NULL; + orte_node_t *node = NULL; + bool found = false; + int num_removed = 0, num_to_remove; + + if( NULL == current_global_jobdata ) { + return ORTE_SUCCESS; + } + + /* JJH Nasty Hack */ + num_to_remove = current_global_jobdata->num_procs / 2; + num_to_remove += 1; + + /* + * Find this process in the known failures list + */ + found = false; + if( mca_errmgr_autor_component.skip_oldnode ) { + for(item = opal_list_get_first(procs_pending_recovery); + item != opal_list_get_end(procs_pending_recovery); + item = opal_list_get_next(item) ) { + wp_item = (errmgr_autor_wp_item_t*)item; + + if( wp_item->name.vpid == proc->name.vpid && + wp_item->name.jobid == proc->name.jobid ) { + found = true; + break; + } + } + } + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor: suggest_map() " + "Process remapping: %s oldnode %s, %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(&proc->name), + oldnode->name, + (found ? "Failed Proc." : "Good Proc.") )); + + /* + * If not a failed process, then return it to the oldnode + * If failed process, do not place it back on the same node + */ + num_removed = 0; + for( item = opal_list_get_first(node_list); + item != opal_list_get_end(node_list); + item = opal_list_get_next(item) ) { + node = (orte_node_t*)item; + if( found ) { + if( num_removed >= num_to_remove ) { + break; + } + /* JJH Nasty Hack */ +#if 0 + /* Remove oldnode (if more than one node) */ + if( node == oldnode && 1 < opal_list_get_size(node_list) ) { + opal_output(0, "JJH Remove Node (%s)", node->name); + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + } +#else + if( 1 < opal_list_get_size(node_list) ) { + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + } +#endif + num_removed++; + } else { + /* Stay on same node */ + if( node != oldnode ) { + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + } + } + } + + return ORTE_SUCCESS; +} + +int orte_errmgr_autor_global_ft_event(int state) +{ + return ORTE_SUCCESS; +} + +/************************ + * Function Definitions: Local + ************************/ +int orte_errmgr_autor_local_module_init(void) +{ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:init() Local"); + + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + return ORTE_SUCCESS; +} + +int orte_errmgr_autor_local_module_finalize(void) +{ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:finalize() Local"); + + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + return ORTE_SUCCESS; +} + +int orte_errmgr_autor_local_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state) +{ + /* + * If this component is enabled, then the global version takes care of + * recovery policy. Tell lower layers in the ErrMgr stack -not- to recover + * locally. + */ + *stack_state ^= ORTE_ERRMGR_STACK_STATE_JOB_ABORT; + *stack_state |= ORTE_ERRMGR_STACK_STATE_RECOVERED; + + OPAL_OUTPUT_VERBOSE((1, orte_errmgr_base.output, + "%s errmgr:autor: update_state() (Local) job state %s" + " for proc %s state %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + orte_job_state_to_str(jobstate), + (NULL == proc_name) ? "NULL" : ORTE_NAME_PRINT(proc_name), + orte_proc_state_to_str(state) )); + + return ORTE_SUCCESS; +} + +int orte_errmgr_autor_local_ft_event(int state) +{ + return ORTE_SUCCESS; +} + +/***************** + * Local Functions + *****************/ +static void errmgr_autor_process_fault_app(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state) +{ + errmgr_autor_wp_item_t *wp_item = NULL; + struct timeval soon; + + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor: process_fault() " + "Process fault! proc %s (0x%x)", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(proc), + state)); + + if( !orte_sstore_base_is_checkpoint_available ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor: process_fault() " + "No checkpoints are available for this job! Cannot Automaticly Recover!", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) )); + *stack_state |= ORTE_ERRMGR_STACK_STATE_JOB_ABORT; + opal_show_help("help-orte-errmgr-autor.txt", "failed_to_recover_proc", true, + ORTE_NAME_PRINT(proc), proc->vpid); + return; + } + + *stack_state ^= ORTE_ERRMGR_STACK_STATE_JOB_ABORT; + *stack_state |= ORTE_ERRMGR_STACK_STATE_RECOVERED; + + /* + * If we are already in the shutdown stage of the recovery, then just skip it + */ + if( autor_mask_faults ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor:process_fault() " + "Currently recovering the job. Failure masked!", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); + return; + } + + /* + * Append this process to the list to process + */ + wp_item = OBJ_NEW(errmgr_autor_wp_item_t); + wp_item->name.jobid = proc->jobid; + wp_item->name.vpid = proc->vpid; + wp_item->state = state; + + opal_list_append(procs_pending_recovery, &(wp_item->super)); + + /* + * Activate the timer, if it is not already setup + */ + if( !autor_timer_active ) { + autor_timer_active = true; + + opal_evtimer_set(autor_timer_event, errmgr_autor_recover_processes, NULL); + soon.tv_sec = mca_errmgr_autor_component.recovery_delay; + soon.tv_usec = 0; + opal_evtimer_add(autor_timer_event, &soon); + } + + return; +} + +static void errmgr_autor_process_fault_daemon(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state) +{ + orte_proc_t *loc_proc = NULL, *child_proc = NULL; + orte_std_cntr_t i_proc; + int32_t i; + + OPAL_OUTPUT_VERBOSE((15, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor: process_fault_daemon() " + "------- Daemon fault reported! proc %s (0x%x)", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(proc), + state)); + + /* + * Set the process state in the job data structure + */ + for(i = 0; i < jdata->procs->size; ++i) { + if (NULL == (loc_proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, i))) { + continue; + } + + if( loc_proc->name.vpid != proc->vpid) { + continue; + } + + loc_proc->state = state; + + break; + } + + /* + * Remove the route to this process + */ + orte_routed.delete_route(proc); + + /* + * If the aborted daemon had active processes on its node, then we should + * make sure to signal that all the children are gone. + */ + if( loc_proc->node->num_procs > 0 ) { + OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output, + "%s errmgr:base: stabalize_runtime() " + "------- Daemon lost with the following processes", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); + + for(i_proc = 0; i_proc < opal_pointer_array_get_size(loc_proc->node->procs); ++i_proc) { + child_proc = (orte_proc_t*)opal_pointer_array_get_item(loc_proc->node->procs, i_proc); + if( NULL == child_proc ) { + continue; + } + + OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output, + "%s errmgr:base: stabalize_runtime() " + "\t %s [0x%x]", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(&child_proc->name), + child_proc->state)); + + if( child_proc->last_errmgr_state < child_proc->state ) { + child_proc->last_errmgr_state = child_proc->state; + orte_errmgr.update_state(child_proc->name.jobid, ORTE_JOB_STATE_COMM_FAILED, + &(child_proc->name), ORTE_PROC_STATE_COMM_FAILED, + 0, 1); + /*orte_errmgr_base_proc_aborted(&child_proc->name, -1);*/ + } + } + } + return; +} + +void errmgr_autor_wp_item_construct(errmgr_autor_wp_item_t *wp) +{ + wp->name.jobid = ORTE_JOBID_INVALID; + wp->name.vpid = ORTE_VPID_INVALID; + + wp->state = 0; +} + +void errmgr_autor_wp_item_destruct(errmgr_autor_wp_item_t *wp) +{ + wp->name.jobid = ORTE_JOBID_INVALID; + wp->name.vpid = ORTE_VPID_INVALID; + + wp->state = 0; +} + +static int display_procs(void ) +{ + opal_list_item_t *item = NULL; + errmgr_autor_wp_item_t *wp_item = NULL; + char *proc_str = NULL; + char *tmp_str = NULL; + + for(item = opal_list_get_first(procs_pending_recovery); + item != opal_list_get_end(procs_pending_recovery); + item = opal_list_get_next(item) ) { + wp_item = (errmgr_autor_wp_item_t*)item; + + if( NULL == proc_str ) { + asprintf(&proc_str, "\t%s Rank %d\n", + ORTE_NAME_PRINT(&(wp_item->name)), + (int)wp_item->name.vpid); + } else { + tmp_str = strdup(proc_str); + free(proc_str); + proc_str = NULL; + asprintf(&proc_str, "%s\t%s Rank %d\n", + tmp_str, + ORTE_NAME_PRINT(&(wp_item->name)), + (int)wp_item->name.vpid); + } + } + + opal_show_help("help-orte-errmgr-autor.txt", "recovering_job", true, + proc_str); + + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + if( NULL != proc_str ) { + free(proc_str); + proc_str = NULL; + } + + return ORTE_SUCCESS; +} + +static int autor_procs_sort_compare_fn(opal_list_item_t **a, + opal_list_item_t **b) +{ + errmgr_autor_wp_item_t *wp_a, *wp_b; + + wp_a = (errmgr_autor_wp_item_t*)(*a); + wp_b = (errmgr_autor_wp_item_t*)(*b); + + if( wp_a->name.vpid > wp_b->name.vpid ) { + return 1; + } + else if( wp_a->name.vpid == wp_b->name.vpid ) { + return 0; + } + else { + return -1; + } +} + +static void errmgr_autor_recover_processes(int fd, short event, void *cbdata) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_list_item_t *item = NULL; + errmgr_autor_wp_item_t *wp_item = NULL; + orte_std_cntr_t i_proc; + orte_proc_t *proc = NULL; + orte_sstore_base_global_snapshot_info_t *snapshot = NULL; + char * tmp_str = NULL; + + autor_mask_faults = true; + ERRMGR_AUTOR_CLEAR_TIMERS(); + ERRMGR_AUTOR_SET_TIMER(ERRMGR_AUTOR_TIMER_START); + + /* + * Display the processes that are to be recovered + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor:recover() " + "------- Display known failed processes in the job %s -------", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_JOBID_PRINT(current_global_jobdata->jobid))); + + opal_list_sort(procs_pending_recovery, autor_procs_sort_compare_fn); + display_procs(); + + /* + * Find the latest checkpoint + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor:recover() " + "------- Find the latest checkpoint for the job %s -------", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_JOBID_PRINT(current_global_jobdata->jobid))); + + snapshot = OBJ_NEW(orte_sstore_base_global_snapshot_info_t); + if( ORTE_SUCCESS != (ret = orte_sstore.request_global_snapshot_data(&orte_sstore_handle_last_stable, snapshot)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + ERRMGR_AUTOR_SET_TIMER(ERRMGR_AUTOR_TIMER_SETUP); + + /* + * Safely terminate the entire job + */ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:recover() " + "------- Safely terminate the job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + for(i_proc = 0; i_proc < opal_pointer_array_get_size(current_global_jobdata->procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(current_global_jobdata->procs, i_proc); + if( NULL == proc ) { + continue; + } + if( proc->state < ORTE_PROC_STATE_UNTERMINATED ) { + proc->state = ORTE_PROC_STATE_MIGRATING; + } + if( current_global_jobdata->stdin_target == proc->name.vpid ) { + orte_iof.close(&(proc->name), ORTE_IOF_STDIN); + } + } + + orte_plm.terminate_procs(current_global_jobdata->procs); + + /* + * Wait for the job to terminate all processes + */ + while(!check_if_terminated(current_global_jobdata->procs) ) { + opal_progress(); + } + + ERRMGR_AUTOR_SET_TIMER(ERRMGR_AUTOR_TIMER_TERM); + + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:recover() " + "------- Done waiting for termination of job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + current_global_jobdata->num_terminated = current_global_jobdata->num_procs; + orte_plm_base_reset_job(current_global_jobdata); + + /* + * Construct the app contexts to restart + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "%s errmgr:autor:recover() " + "------- Rebuild job %s app context -------", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_JOBID_PRINT(current_global_jobdata->jobid))); + for(i_proc = 0; i_proc < opal_pointer_array_get_size(current_global_jobdata->procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(current_global_jobdata->procs, i_proc); + if( NULL == proc ) { + continue; + } + + if( ORTE_SUCCESS != (ret = orte_errmgr_base_update_app_context_for_cr_recovery(current_global_jobdata, + proc, + &(snapshot->local_snapshots))) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "\tAdjusted: \"%s\" [0x%d] [%s]\n", + ORTE_NAME_PRINT(&proc->name), proc->state, proc->node->name)); + } + + ERRMGR_AUTOR_SET_TIMER(ERRMGR_AUTOR_TIMER_RESETUP); + + /* + * Spawn the restarted job + */ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:recover() " + "------- Respawning the job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + orte_snapc_base_has_recovered = false; + autor_mask_faults = false; /* Failures pass this point are worth noting */ + orte_plm.spawn(current_global_jobdata); + + /* + * Wait for all the processes to restart + */ + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:recover() " + "------- Waiting for restart -------"); + while(!check_if_restarted(current_global_jobdata->procs) ) { + opal_progress(); + } + + ERRMGR_AUTOR_SET_TIMER(ERRMGR_AUTOR_TIMER_RESTART); + + /* + * All done + */ + while( !orte_snapc_base_has_recovered ) { + opal_progress(); + } + + opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle, + "errmgr:autor:recover() " + "------- Finished recovering job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + opal_show_help("help-orte-errmgr-autor.txt", "recovery_complete", true); + + ERRMGR_AUTOR_SET_TIMER(ERRMGR_AUTOR_TIMER_FINISH); + + cleanup: + while(NULL != (item = opal_list_remove_first(procs_pending_recovery))) { + wp_item = (errmgr_autor_wp_item_t*)item; + OBJ_RELEASE(wp_item); + } + + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + ERRMGR_AUTOR_DISPLAY_ALL_TIMERS(); + + autor_timer_active = false; + autor_mask_faults = false; + + return; +} + +static int check_if_terminated(opal_pointer_array_t *procs) +{ + orte_std_cntr_t i_proc; + orte_proc_t *proc = NULL; + bool is_done; + + if( NULL == procs ){ + return true; + } + + is_done = true; + for(i_proc = 0; i_proc < opal_pointer_array_get_size(procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(procs, i_proc); + if( NULL == proc ) { + continue; + } + + if( proc->state < ORTE_PROC_STATE_UNTERMINATED || + proc->state == ORTE_PROC_STATE_MIGRATING ) { + is_done = false; + break; + } + } + + if( !is_done ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "\t Still waiting for termination: \"%s\" [0x%x] < [0x%x]\n", + ORTE_NAME_PRINT(&proc->name), proc->state, ORTE_PROC_STATE_UNTERMINATED)); + } + + return is_done; +} + +static int check_if_restarted(opal_pointer_array_t *procs) +{ + orte_std_cntr_t i_proc; + orte_proc_t *proc = NULL; + bool is_done; + + if( NULL == procs ){ + return true; + } + + is_done = true; + for(i_proc = 0; i_proc < opal_pointer_array_get_size(procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(procs, i_proc); + if( NULL == proc ) { + continue; + } + + if( !(ORTE_PROC_STATE_RUNNING & proc->state) ) { + is_done = false; + break; + } + } + + if( !is_done ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_autor_component.super.output_handle, + "\t Still waiting for restart: \"%s\" [0x%x] != [0x%x]\n", + ORTE_NAME_PRINT(&proc->name), proc->state, ORTE_PROC_STATE_RUNNING)); + } + + return is_done; +} + +static void update_proc(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_exit_code_t exit_code) +{ + opal_list_item_t *item, *next; + orte_odls_child_t *child; + orte_proc_t *proct; + int i; + + /*** UPDATE LOCAL CHILD ***/ + for (item = opal_list_get_first(&orte_local_children); + item != opal_list_get_end(&orte_local_children); + item = next) { + next = opal_list_get_next(item); + child = (orte_odls_child_t*)item; + if (child->name->jobid == proc->jobid) { + if (child->name->vpid == proc->vpid) { + child->state = state; + child->exit_code = exit_code; + proct = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, child->name->vpid); + proct->state = state; + proct->exit_code = exit_code; + /* (JJH: See note below) + if (ORTE_PROC_STATE_UNTERMINATED < state) { + jdata->num_terminated++; + } + */ + return; + } + } + } + + /*** UPDATE REMOTE CHILD ***/ + for (i=0; i < jdata->procs->size; i++) { + if (NULL == (proct = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, i))) { + continue; + } + if (proct->name.jobid != proc->jobid || + proct->name.vpid != proc->vpid) { + continue; + } + proct->state = state; + proct->exit_code = exit_code; + if (ORTE_PROC_STATE_UNTERMINATED < state) { + /* JJH: Do not increment this value. Otherwise the 'hnp' component + * will try to terminate us after we request the job to + * termiante. So we fake it out by making sure that + * num_terminated never equals num_procs. + * There should be a better way though... + */ + /* update the counter so we can terminate */ + /*jdata->num_terminated++;*/ + } + return; + } +} + +/************************ + * Timing + ************************/ +static void errmgr_autor_set_time(int idx) +{ + if(idx < ERRMGR_AUTOR_TIMER_MAX ) { + if( timer_start[idx] <= 0.0 ) { + timer_start[idx] = errmgr_autor_get_time(); + } + } +} + +static void errmgr_autor_display_all_timers(void) +{ + double diff = 0.0; + char * label = NULL; + + opal_output(0, "Auto. Recovery Timing: ******************** Summary Begin\n"); + + /********** Structure Setup **********/ + label = strdup("Setup"); + diff = timer_start[ERRMGR_AUTOR_TIMER_SETUP] - timer_start[ERRMGR_AUTOR_TIMER_START]; + errmgr_autor_display_indv_timer_core(diff, label); + free(label); + + /********** Termination **********/ + label = strdup("Terminate"); + diff = timer_start[ERRMGR_AUTOR_TIMER_TERM] - timer_start[ERRMGR_AUTOR_TIMER_SETUP]; + errmgr_autor_display_indv_timer_core(diff, label); + free(label); + + /********** Setup new job **********/ + label = strdup("Setup Relaunch"); + diff = timer_start[ERRMGR_AUTOR_TIMER_RESETUP] - timer_start[ERRMGR_AUTOR_TIMER_TERM]; + errmgr_autor_display_indv_timer_core(diff, label); + free(label); + + /********** Restart **********/ + label = strdup("Restart"); + diff = timer_start[ERRMGR_AUTOR_TIMER_RESTART] - timer_start[ERRMGR_AUTOR_TIMER_RESETUP]; + errmgr_autor_display_indv_timer_core(diff, label); + free(label); + + /********** Finish **********/ + label = strdup("Finalize"); + diff = timer_start[ERRMGR_AUTOR_TIMER_FINISH] - timer_start[ERRMGR_AUTOR_TIMER_RESTART]; + errmgr_autor_display_indv_timer_core(diff, label); + free(label); + + opal_output(0, "Auto. Recovery Timing: ******************** Summary End\n"); +} + +static void errmgr_autor_clear_timers(void) +{ + int i; + for(i = 0; i < ERRMGR_AUTOR_TIMER_MAX; ++i) { + timer_start[i] = 0.0; + } +} + +static double errmgr_autor_get_time(void) +{ + double wtime; + +#if OPAL_TIMER_USEC_NATIVE + wtime = (double)opal_timer_base_get_usec() / 1000000.0; +#else + struct timeval tv; + gettimeofday(&tv, NULL); + wtime = tv.tv_sec; + wtime += (double)tv.tv_usec / 1000000.0; +#endif + + return wtime; +} + +static void errmgr_autor_display_indv_timer_core(double diff, char *str) +{ + double total = 0; + double perc = 0; + + total = timer_start[ERRMGR_AUTOR_TIMER_MAX-1] - timer_start[ERRMGR_AUTOR_TIMER_START]; + perc = (diff/total) * 100; + + opal_output(0, + "errmgr_autor: timing: %-20s = %10.2f s\t%10.2f s\t%6.2f\n", + str, + diff, + total, + perc); + return; +} diff --git a/orte/mca/errmgr/autor/help-orte-errmgr-autor.txt b/orte/mca/errmgr/autor/help-orte-errmgr-autor.txt new file mode 100644 index 0000000000..9beec33b4d --- /dev/null +++ b/orte/mca/errmgr/autor/help-orte-errmgr-autor.txt @@ -0,0 +1,28 @@ + -*- text -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for ORTE ErrMgr AutoR framework. +# +[recovering_job] +Notice: The processes listed below failed unexpectedly. + Using the last checkpoint to recover the job. + Please standby. +%s +[recovery_complete] +Notice: The job has been successfully recovered from the + last checkpoint. +[failed_to_recover_proc] +Error: The process below has failed. There is no checkpoint available for + this job, so we are terminating the application since automatic + recovery cannot occur. +Internal Name: %s +MCW Rank: %d diff --git a/orte/mca/errmgr/base/Makefile.am b/orte/mca/errmgr/base/Makefile.am index 7a8dbabb5b..50fad03e5c 100644 --- a/orte/mca/errmgr/base/Makefile.am +++ b/orte/mca/errmgr/base/Makefile.am @@ -1,5 +1,5 @@ # -# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # Copyright (c) 2004-2005 The University of Tennessee and The University @@ -24,4 +24,5 @@ libmca_errmgr_la_SOURCES += \ base/errmgr_base_close.c \ base/errmgr_base_select.c \ base/errmgr_base_open.c \ - base/errmgr_base_fns.c + base/errmgr_base_fns.c \ + base/errmgr_base_tool.c diff --git a/orte/mca/errmgr/base/base.h b/orte/mca/errmgr/base/base.h index f5f25a29c0..bcf993ed89 100644 --- a/orte/mca/errmgr/base/base.h +++ b/orte/mca/errmgr/base/base.h @@ -30,6 +30,7 @@ #include "opal/class/opal_list.h" #include "opal/mca/mca.h" +#include "orte/mca/snapc/base/base.h" #include "orte/mca/errmgr/errmgr.h" @@ -56,6 +57,51 @@ ORTE_DECLSPEC int orte_errmgr_base_close(void); */ ORTE_DECLSPEC extern opal_list_t orte_errmgr_base_components_available; +/** + * Interfaces for orte-migrate tool + */ +#if OPAL_ENABLE_FT_CR +/** + * Migrating States + */ +#define ORTE_ERRMGR_MIGRATE_STATE_ERROR (ORTE_SNAPC_CKPT_MAX + 1) +#define ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS (ORTE_SNAPC_CKPT_MAX + 2) +#define ORTE_ERRMGR_MIGRATE_STATE_NONE (ORTE_SNAPC_CKPT_MAX + 3) +#define ORTE_ERRMGR_MIGRATE_STATE_REQUEST (ORTE_SNAPC_CKPT_MAX + 4) +#define ORTE_ERRMGR_MIGRATE_STATE_RUNNING (ORTE_SNAPC_CKPT_MAX + 5) +#define ORTE_ERRMGR_MIGRATE_STATE_RUN_CKPT (ORTE_SNAPC_CKPT_MAX + 6) +#define ORTE_ERRMGR_MIGRATE_STATE_STARTUP (ORTE_SNAPC_CKPT_MAX + 7) +#define ORTE_ERRMGR_MIGRATE_STATE_FINISH (ORTE_SNAPC_CKPT_MAX + 8) +#define ORTE_ERRMGR_MIGRATE_MAX (ORTE_SNAPC_CKPT_MAX + 9) + +/* + * Commands for command line tool and ErrMgr interaction + */ +typedef uint8_t orte_errmgr_tool_cmd_flag_t; +#define ORTE_ERRMGR_MIGRATE_TOOL_CMD OPAL_UINT8 +#define ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD 1 +#define ORTE_ERRMGR_MIGRATE_TOOL_UPDATE_CMD 2 + +/* Initialize/Finalize the orte-migrate communication functionality */ +ORTE_DECLSPEC int orte_errmgr_base_tool_init(void); +ORTE_DECLSPEC int orte_errmgr_base_tool_finalize(void); + +ORTE_DECLSPEC int orte_errmgr_base_migrate_state_str(char ** state_str, int state); + +ORTE_DECLSPEC int orte_errmgr_base_migrate_update(int status); + +/* + * Interfaces for C/R related recovery + */ +ORTE_DECLSPEC int orte_errmgr_base_update_app_context_for_cr_recovery(orte_job_t *jobdata, + orte_proc_t *proc, + opal_list_t *local_snapshots); + +ORTE_DECLSPEC int orte_errmgr_base_restart_job(orte_jobid_t jobid, char * global_handle, int seq_num); +ORTE_DECLSPEC int orte_errmgr_base_migrate_job(orte_jobid_t jobid, orte_snapc_base_request_op_t *datum); + +#endif + /* * Additional External API function declared in errmgr.h */ diff --git a/orte/mca/errmgr/base/errmgr_base_fns.c b/orte/mca/errmgr/base/errmgr_base_fns.c index fd3e3db1ca..e206554394 100644 --- a/orte/mca/errmgr/base/errmgr_base_fns.c +++ b/orte/mca/errmgr/base/errmgr_base_fns.c @@ -21,27 +21,157 @@ #include "orte_config.h" #include "orte/constants.h" +#ifdef HAVE_STRING_H +#include +#endif +#if HAVE_SYS_TYPES_H +#include +#endif /* HAVE_SYS_TYPES_H */ #ifdef HAVE_UNISTD_H #include -#endif +#endif /* HAVE_UNISTD_H */ +#if HAVE_SYS_TYPES_H +#include +#endif /* HAVE_SYS_TYPES_H */ +#if HAVE_SYS_STAT_H +#include +#endif /* HAVE_SYS_STAT_H */ +#ifdef HAVE_DIRENT_H +#include +#endif /* HAVE_DIRENT_H */ +#include + #include #include +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" #include "opal/util/trace.h" +#include "opal/util/os_dirpath.h" #include "opal/util/output.h" +#include "opal/util/basename.h" +#include "opal/util/argv.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" #include "opal/util/opal_sos.h" #include "orte/util/name_fns.h" #include "orte/util/session_dir.h" + +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/runtime.h" +#include "orte/runtime/orte_wait.h" +#include "orte/runtime/orte_locks.h" + #include "orte/mca/ess/ess.h" #include "orte/mca/odls/odls.h" +#include "orte/mca/plm/plm.h" +#include "orte/mca/rml/rml.h" +#include "orte/mca/rml/rml_types.h" #include "orte/mca/routed/routed.h" -#include "orte/runtime/orte_globals.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/mca/snapc/base/base.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" #include "orte/mca/errmgr/errmgr.h" #include "orte/mca/errmgr/base/base.h" #include "orte/mca/errmgr/base/errmgr_private.h" +/* + * Object stuff + */ +void orte_errmgr_predicted_proc_construct(orte_errmgr_predicted_proc_t *item); +void orte_errmgr_predicted_proc_destruct( orte_errmgr_predicted_proc_t *item); + +OBJ_CLASS_INSTANCE(orte_errmgr_predicted_proc_t, + opal_list_item_t, + orte_errmgr_predicted_proc_construct, + orte_errmgr_predicted_proc_destruct); + +void orte_errmgr_predicted_proc_construct(orte_errmgr_predicted_proc_t *item) +{ + item->proc_name.vpid = ORTE_VPID_INVALID; + item->proc_name.jobid = ORTE_JOBID_INVALID; +} + +void orte_errmgr_predicted_proc_destruct( orte_errmgr_predicted_proc_t *item) +{ + item->proc_name.vpid = ORTE_VPID_INVALID; + item->proc_name.jobid = ORTE_JOBID_INVALID; +} + +void orte_errmgr_predicted_node_construct(orte_errmgr_predicted_node_t *item); +void orte_errmgr_predicted_node_destruct( orte_errmgr_predicted_node_t *item); + +OBJ_CLASS_INSTANCE(orte_errmgr_predicted_node_t, + opal_list_item_t, + orte_errmgr_predicted_node_construct, + orte_errmgr_predicted_node_destruct); + +void orte_errmgr_predicted_node_construct(orte_errmgr_predicted_node_t *item) +{ + item->node_name = NULL; +} + +void orte_errmgr_predicted_node_destruct( orte_errmgr_predicted_node_t *item) +{ + if( NULL != item->node_name ) { + free(item->node_name); + item->node_name = NULL; + } +} + +void orte_errmgr_predicted_map_construct(orte_errmgr_predicted_map_t *item); +void orte_errmgr_predicted_map_destruct( orte_errmgr_predicted_map_t *item); + +OBJ_CLASS_INSTANCE(orte_errmgr_predicted_map_t, + opal_list_item_t, + orte_errmgr_predicted_map_construct, + orte_errmgr_predicted_map_destruct); + +void orte_errmgr_predicted_map_construct(orte_errmgr_predicted_map_t *item) +{ + item->proc_name.vpid = ORTE_VPID_INVALID; + item->proc_name.jobid = ORTE_JOBID_INVALID; + + item->node_name = NULL; + + item->map_proc_name.vpid = ORTE_VPID_INVALID; + item->map_proc_name.jobid = ORTE_JOBID_INVALID; + + item->map_node_name = NULL; + item->off_current_node = false; + item->pre_map_fixed_node = NULL; +} + +void orte_errmgr_predicted_map_destruct( orte_errmgr_predicted_map_t *item) +{ + item->proc_name.vpid = ORTE_VPID_INVALID; + item->proc_name.jobid = ORTE_JOBID_INVALID; + + if( NULL != item->node_name ) { + free(item->node_name); + item->node_name = NULL; + } + + item->map_proc_name.vpid = ORTE_VPID_INVALID; + item->map_proc_name.jobid = ORTE_JOBID_INVALID; + + if( NULL != item->map_node_name ) { + free(item->map_node_name); + item->map_node_name = NULL; + } + + item->off_current_node = false; + + if( NULL != item->pre_map_fixed_node ) { + free(item->pre_map_fixed_node); + item->pre_map_fixed_node = NULL; + } +} + /* * Public interfaces */ @@ -135,9 +265,9 @@ int orte_errmgr_base_abort(int error_code, char *fmt, ...) return ORTE_SUCCESS; } -int orte_errmgr_base_predicted_fault(char ***proc_list, - char ***node_list, - char ***suggested_nodes) +int orte_errmgr_base_predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map) { orte_errmgr_base_module_t *module = NULL; int i, rc; @@ -155,7 +285,7 @@ int orte_errmgr_base_predicted_fault(char ***proc_list, continue; } if( NULL != module->predicted_fault ) { - rc = module->predicted_fault(proc_list, node_list, suggested_nodes, &stack_state); + rc = module->predicted_fault(proc_list, node_list, suggested_map, &stack_state); if (ORTE_SUCCESS != rc || ORTE_ERRMGR_STACK_STATE_COMPLETE & stack_state) { break; } @@ -218,3 +348,348 @@ int orte_errmgr_base_ft_event(int state) return ORTE_SUCCESS; } + +/******************** + * Utility functions + ********************/ +#if OPAL_ENABLE_FT_CR +int orte_errmgr_base_migrate_state_str(char ** state_str, int state) +{ + switch(state) { + case ORTE_ERRMGR_MIGRATE_STATE_NONE: + *state_str = strdup(" -- "); + break; + case ORTE_ERRMGR_MIGRATE_STATE_REQUEST: + *state_str = strdup("Requested"); + break; + case ORTE_ERRMGR_MIGRATE_STATE_RUNNING: + *state_str = strdup("Running"); + break; + case ORTE_ERRMGR_MIGRATE_STATE_RUN_CKPT: + *state_str = strdup("Checkpointing"); + break; + case ORTE_ERRMGR_MIGRATE_STATE_STARTUP: + *state_str = strdup("Restarting"); + break; + case ORTE_ERRMGR_MIGRATE_STATE_FINISH: + *state_str = strdup("Finished"); + break; + case ORTE_ERRMGR_MIGRATE_STATE_ERROR: + *state_str = strdup("Error"); + break; + case ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS: + *state_str = strdup("Error: Migration in progress"); + break; + default: + asprintf(state_str, "Unknown %d", state); + break; + } + + return ORTE_SUCCESS; +} +#endif + +#if OPAL_ENABLE_FT_CR +int orte_errmgr_base_update_app_context_for_cr_recovery(orte_job_t *jobdata, + orte_proc_t *proc, + opal_list_t *local_snapshots) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_list_item_t *item = NULL; + orte_std_cntr_t i_app; + int argc = 0; + orte_app_context_t *cur_app_context = NULL; + orte_app_context_t *new_app_context = NULL; + orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL; + char *reference_fmt_str = NULL; + char *location_str = NULL; + char *cache_location_str = NULL; + char *ref_location_fmt_str = NULL; + char *tmp_str = NULL; + char *global_snapshot_ref = NULL; + char *global_snapshot_seq = NULL; + + /* + * Get the snapshot restart command for this process + * JJH CLEANUP: Pass in the vpid_snapshot, so we don't have to look it up every time? + */ + for(item = opal_list_get_first(local_snapshots); + item != opal_list_get_end(local_snapshots); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_sstore_base_local_snapshot_info_t*)item; + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, + &vpid_snapshot->process_name, + &proc->name) ) { + break; + } + else { + vpid_snapshot = NULL; + } + } + + if( NULL == vpid_snapshot ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + orte_sstore.get_attr(vpid_snapshot->ss_handle, + SSTORE_METADATA_LOCAL_SNAP_REF_FMT, + &reference_fmt_str); + orte_sstore.get_attr(vpid_snapshot->ss_handle, + SSTORE_METADATA_LOCAL_SNAP_LOC, + &location_str); + orte_sstore.get_attr(vpid_snapshot->ss_handle, + SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT, + &ref_location_fmt_str); + orte_sstore.get_attr(vpid_snapshot->ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_REF, + &global_snapshot_ref); + orte_sstore.get_attr(vpid_snapshot->ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_SEQ, + &global_snapshot_seq); + + /* + * Find current app_context + */ + cur_app_context = NULL; + for(i_app = 0; i_app < opal_pointer_array_get_size(jobdata->apps); ++i_app) { + cur_app_context = (orte_app_context_t *)opal_pointer_array_get_item(jobdata->apps, + i_app); + if( NULL == cur_app_context ) { + continue; + } + if(proc->app_idx == cur_app_context->idx) { + break; + } + } + + if( NULL == cur_app_context ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * if > 1 processes in this app context + * Create a new app_context + * Copy over attributes + * Add it to the job_t data structure + * Associate it with this process in the job + * else + * Reuse this app_context + */ + if( cur_app_context->num_procs > 1 ) { + /* Create a new app_context */ + new_app_context = OBJ_NEW(orte_app_context_t); + + /* Copy over attributes */ + new_app_context->idx = cur_app_context->idx; + new_app_context->app = NULL; /* strdup(cur_app_context->app); */ + new_app_context->num_procs = 1; + new_app_context->argv = NULL; /* opal_argv_copy(cur_app_context->argv); */ + new_app_context->env = opal_argv_copy(cur_app_context->env); + new_app_context->cwd = (NULL == cur_app_context->cwd ? NULL : + strdup(cur_app_context->cwd)); + new_app_context->user_specified_cwd = cur_app_context->user_specified_cwd; + new_app_context->hostfile = (NULL == cur_app_context->hostfile ? NULL : + strdup(cur_app_context->hostfile)); + new_app_context->add_hostfile = (NULL == cur_app_context->add_hostfile ? NULL : + strdup(cur_app_context->add_hostfile)); + new_app_context->dash_host = opal_argv_copy(cur_app_context->dash_host); + new_app_context->prefix_dir = (NULL == cur_app_context->prefix_dir ? NULL : + strdup(cur_app_context->prefix_dir)); + new_app_context->preload_binary = false; + new_app_context->preload_libs = false; + new_app_context->preload_files_dest_dir = NULL; + new_app_context->preload_files_src_dir = NULL; + + asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid); + asprintf(&(new_app_context->sstore_load), + "%s:%s:%s:%s:%s:%s", + location_str, + global_snapshot_ref, + tmp_str, + (vpid_snapshot->compress_comp == NULL ? "" : vpid_snapshot->compress_comp), + (vpid_snapshot->compress_postfix == NULL ? "" : vpid_snapshot->compress_postfix), + global_snapshot_seq); + + new_app_context->used_on_node = cur_app_context->used_on_node; + + /* Add it to the job_t data structure */ + /*current_global_jobdata->num_apps++; */ + new_app_context->idx = (jobdata->num_apps); + proc->app_idx = new_app_context->idx; + + opal_pointer_array_add(jobdata->apps, new_app_context); + ++(jobdata->num_apps); + + /* Remove association with the old app_context */ + --(cur_app_context->num_procs); + } + else { + new_app_context = cur_app_context; + + /* Cleanout old stuff */ + free(new_app_context->app); + new_app_context->app = NULL; + + opal_argv_free(new_app_context->argv); + new_app_context->argv = NULL; + + asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid); + asprintf(&(new_app_context->sstore_load), + "%s:%s:%s:%s:%s:%s", + location_str, + global_snapshot_ref, + tmp_str, + (vpid_snapshot->compress_comp == NULL ? "" : vpid_snapshot->compress_comp), + (vpid_snapshot->compress_postfix == NULL ? "" : vpid_snapshot->compress_postfix), + global_snapshot_seq); + } + + /* + * Update the app_context with the restart informaiton + */ + new_app_context->app = strdup("opal-restart"); + opal_argv_append(&argc, &(new_app_context->argv), new_app_context->app); + opal_argv_append(&argc, &(new_app_context->argv), "-l"); + opal_argv_append(&argc, &(new_app_context->argv), location_str); + opal_argv_append(&argc, &(new_app_context->argv), "-m"); + opal_argv_append(&argc, &(new_app_context->argv), orte_sstore_base_local_metadata_filename); + opal_argv_append(&argc, &(new_app_context->argv), "-r"); + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid); + opal_argv_append(&argc, &(new_app_context->argv), tmp_str); + + cleanup: + if( NULL != tmp_str) { + free(tmp_str); + tmp_str = NULL; + } + if( NULL != location_str ) { + free(location_str); + location_str = NULL; + } + if( NULL != cache_location_str ) { + free(cache_location_str); + cache_location_str = NULL; + } + if( NULL != reference_fmt_str ) { + free(reference_fmt_str); + reference_fmt_str = NULL; + } + if( NULL != ref_location_fmt_str ) { + free(ref_location_fmt_str); + ref_location_fmt_str = NULL; + } + + return exit_status; +} +#endif + +#if OPAL_ENABLE_FT_CR +int orte_errmgr_base_restart_job(orte_jobid_t jobid, char * global_handle, int seq_num) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_process_name_t loc_proc; + orte_sstore_base_handle_t prev_sstore_handle = ORTE_SSTORE_HANDLE_INVALID; + + /* JJH First determine if we can recover this way */ + + /* + * Find the corresponding sstore handle + */ + prev_sstore_handle = orte_sstore_handle_last_stable; + if( ORTE_SUCCESS != (ret = orte_sstore.request_restart_handle(&orte_sstore_handle_last_stable, + NULL, + global_handle, + seq_num, + NULL)) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Start the recovery + */ + orte_snapc_base_has_recovered = false; + loc_proc.jobid = jobid; + loc_proc.vpid = 0; + orte_errmgr_base_update_state(jobid, ORTE_JOB_STATE_RESTART, + &loc_proc, ORTE_PROC_STATE_KILLED_BY_CMD, + 0, 0); + while( !orte_snapc_base_has_recovered ) { + opal_progress(); + } + orte_sstore_handle_last_stable = prev_sstore_handle; + + cleanup: + return exit_status; +} + +int orte_errmgr_base_migrate_job(orte_jobid_t jobid, orte_snapc_base_request_op_t *datum) +{ + int ret, exit_status = ORTE_SUCCESS; + int i; + opal_list_t *proc_list = NULL; + opal_list_t *node_list = NULL; + opal_list_t *suggested_map_list = NULL; + orte_errmgr_predicted_map_t *onto_map = NULL; +#if 0 + orte_errmgr_predicted_proc_t *off_proc = NULL; + orte_errmgr_predicted_node_t *off_node = NULL; +#endif + + proc_list = OBJ_NEW(opal_list_t); + node_list = OBJ_NEW(opal_list_t); + suggested_map_list = OBJ_NEW(opal_list_t); + + for( i = 0; i < datum->mig_num; ++i ) { + /* + * List all processes that are included in the migration. + * We will sort them out in the component. + */ + onto_map = OBJ_NEW(orte_errmgr_predicted_map_t); + + if( (datum->mig_off_node)[i] ) { + onto_map->off_current_node = true; + } else { + onto_map->off_current_node = false; + } + + /* Who to migrate */ + onto_map->proc_name.jobid = jobid; + onto_map->proc_name.vpid = (datum->mig_vpids)[i]; + + /* Destination */ + onto_map->map_proc_name.jobid = jobid; + onto_map->map_proc_name.vpid = (datum->mig_vpid_pref)[i]; + + if( ((datum->mig_host_pref)[i])[0] == '\0') { + onto_map->map_node_name = NULL; + } else { + onto_map->map_node_name = strdup((datum->mig_host_pref)[i]); + } + + opal_list_append(suggested_map_list, &(onto_map->super)); + } + + if( ORTE_SUCCESS != (ret = orte_errmgr_base_predicted_fault(proc_list, node_list, suggested_map_list)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +#endif + +/******************** + * Local Functions + ********************/ diff --git a/orte/mca/errmgr/base/errmgr_base_tool.c b/orte/mca/errmgr/base/errmgr_base_tool.c new file mode 100644 index 0000000000..9842866902 --- /dev/null +++ b/orte/mca/errmgr/base/errmgr_base_tool.c @@ -0,0 +1,477 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#if HAVE_SYS_TYPES_H +#include +#endif /* HAVE_SYS_TYPES_H */ +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ +#if HAVE_SYS_TYPES_H +#include +#endif /* HAVE_SYS_TYPES_H */ +#if HAVE_SYS_STAT_H +#include +#endif /* HAVE_SYS_STAT_H */ +#ifdef HAVE_DIRENT_H +#include +#endif /* HAVE_DIRENT_H */ +#include + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" + +#include "opal/mca/base/mca_base_param.h" +#include "opal/util/os_dirpath.h" +#include "opal/util/output.h" +#include "opal/util/basename.h" +#include "opal/util/argv.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" + +#include "orte/mca/rml/rml.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/runtime/orte_globals.h" +#include "orte/util/name_fns.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" + +/** + * This file contains function for the HNP to communicate with the + * orte-migrate command. + */ +#if OPAL_ENABLE_FT_CR + +/****************** + * Local Functions + ******************/ +static int errmgr_base_tool_start_cmdline_listener(void); +static int errmgr_base_tool_stop_cmdline_listener(void); + +static void errmgr_base_tool_cmdline_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata); +static void errmgr_base_tool_cmdline_process_recv(int fd, + short event, + void *cbdata); + + +/****************** + * Object stuff + ******************/ +static orte_process_name_t errmgr_cmdline_sender = {ORTE_JOBID_INVALID, ORTE_VPID_INVALID}; +static bool errmgr_cmdline_recv_issued = false; +static int errmgr_tool_initialized = false; + +/******************** + * Module Functions + ********************/ +int orte_errmgr_base_tool_init(void) +{ + int ret; + + if( (++errmgr_tool_initialized) != 1 ) { + if( errmgr_tool_initialized < 1 ) { + return OPAL_ERROR; + } + return OPAL_SUCCESS; + } + + /* Only HNP communicates with tools */ + if (! ORTE_PROC_IS_HNP) { + return ORTE_SUCCESS; + } + + /* + * Setup command line migrate tool request listener + */ + if( ORTE_SUCCESS != (ret = errmgr_base_tool_start_cmdline_listener()) ) { + ORTE_ERROR_LOG(ret); + return ret; + } + + return ORTE_SUCCESS; +} + +int orte_errmgr_base_tool_finalize(void) +{ + int ret; + + if( (--errmgr_tool_initialized) != 0 ) { + if( errmgr_tool_initialized < 0 ) { + return OPAL_ERROR; + } + return OPAL_SUCCESS; + } + + /* Only HNP communicates with tools */ + if (! ORTE_PROC_IS_HNP) { + return ORTE_SUCCESS; + } + + /* + * Clean up listeners + */ + if( ORTE_SUCCESS != (ret = errmgr_base_tool_stop_cmdline_listener()) ) { + ORTE_ERROR_LOG(ret); + return ret; + } + + return ORTE_SUCCESS; +} + +int orte_errmgr_base_migrate_update(int status) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t *loc_buffer = NULL; + orte_errmgr_tool_cmd_flag_t command = ORTE_ERRMGR_MIGRATE_TOOL_UPDATE_CMD; + + /* Only HNP communicates with tools */ + if (! ORTE_PROC_IS_HNP) { + return ORTE_SUCCESS; + } + + /* + * If this is an invalid state, then return an error + */ + if( ORTE_ERRMGR_MIGRATE_MAX < status ) { + opal_output(orte_errmgr_base.output, + "errmgr:base:tool:update() Error: Invalid state %d < (Max %d)", + status, ORTE_ERRMGR_MIGRATE_MAX); + return ORTE_ERR_BAD_PARAM; + } + + /* + * If the caller is indicating that they are finished and ready for another + * command, then repost the RML listener. + */ + if( ORTE_ERRMGR_MIGRATE_STATE_NONE == status ) { + if( ORTE_SUCCESS != (ret = errmgr_base_tool_start_cmdline_listener()) ) { + ORTE_ERROR_LOG(ret); + return ret; + } + return ORTE_SUCCESS; + } + + /* + * Noop if invalid peer, or peer not specified + */ + if( OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_NAME_INVALID, &errmgr_cmdline_sender) ) { + return ORTE_SUCCESS; + } + + /* + * Do not send to self, as that is silly. + */ + if( OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_HNP, &errmgr_cmdline_sender) ) { + OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output, + "errmgr:base:tool:update() Warning: Do not send to self!\n")); + return ORTE_SUCCESS; + } + + OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output, + "errmgr:base:tool:update() Sending update command \n", + status)); + + /******************** + * Send over the status of the checkpoint + * - migration state + ********************/ + if (NULL == (loc_buffer = OBJ_NEW(opal_buffer_t))) { + exit_status = ORTE_ERROR; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &command, 1, ORTE_ERRMGR_MIGRATE_TOOL_CMD)) ) { + opal_output(orte_errmgr_base.output, + "errmgr:base:tool:update() Error: DSS Pack (cmd) Failure (ret = %d)\n", + ret); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &status, 1, OPAL_INT))) { + opal_output(orte_errmgr_base.output, + "errmgr:base:tool:update() Error: DSS Pack (status) Failure (ret = %d)\n", + ret); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(&errmgr_cmdline_sender, loc_buffer, ORTE_RML_TAG_MIGRATE, 0))) { + opal_output(orte_errmgr_base.output, + "errmgr:base:tool:update() Error: Send (status) Failure (ret = %d)\n", + ret); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + if(NULL != loc_buffer) { + OBJ_RELEASE(loc_buffer); + loc_buffer = NULL; + } + + return exit_status; +} + +/******************** + * Utility functions + ********************/ + +/******************** + * Local Functions + ********************/ +static int errmgr_base_tool_start_cmdline_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if (errmgr_cmdline_recv_issued && ORTE_PROC_IS_HNP) { + return ORTE_SUCCESS; + } + + OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output, + "errmgr:base:tool: Startup Command Line Channel")); + + /* + * Coordinator command listener + */ + errmgr_cmdline_sender.jobid = ORTE_JOBID_INVALID; + errmgr_cmdline_sender.vpid = ORTE_VPID_INVALID; + if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_MIGRATE, + 0, + errmgr_base_tool_cmdline_recv, + NULL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + errmgr_cmdline_recv_issued = true; + + cleanup: + return exit_status; +} + + +static int errmgr_base_tool_stop_cmdline_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if (!errmgr_cmdline_recv_issued && ORTE_PROC_IS_HNP) { + return ORTE_SUCCESS; + } + + OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output, + "errmgr:base:tool: Shutdown Command Line Channel")); + + if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_MIGRATE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + errmgr_cmdline_recv_issued = false; + + cleanup: + return exit_status; +} + +/***************** + * Listener Callbacks + *****************/ +static void errmgr_base_tool_cmdline_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata) +{ + if( ORTE_RML_TAG_MIGRATE != tag ) { + opal_output(orte_errmgr_base.output, + "errmgr:base:tool:recv() Error: Unknown tag: Received a command message from %s (tag = %d).", + ORTE_NAME_PRINT(sender), tag); + ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM); + return; + } + + OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output, + "errmgr:base:tool:recv() Command Line: Start a migration operation [Sender = %s]", + ORTE_NAME_PRINT(sender))); + + errmgr_cmdline_recv_issued = false; /* Not a persistent RML message */ + + /* + * Do not process this right away - we need to get out of the recv before + * we process the message to avoid performing the rest of the job while + * inside this receive! Instead, setup an event so that the message gets processed + * as soon as we leave the recv. + * + * The macro makes a copy of the buffer, which we release above - the incoming + * buffer, however, is NOT released here, although its payload IS transferred + * to the message buffer for later processing + * + */ + ORTE_MESSAGE_EVENT(sender, buffer, tag, errmgr_base_tool_cmdline_process_recv); + + return; +} + +static void errmgr_base_tool_cmdline_process_recv(int fd, short event, void *cbdata) +{ + int ret; + orte_message_event_t *mev = (orte_message_event_t*)cbdata; + orte_process_name_t *sender = NULL, swap_dest; + orte_errmgr_tool_cmd_flag_t command; + orte_std_cntr_t count = 1; + char *off_nodes = NULL; + char *off_procs = NULL; + char *onto_nodes = NULL; + char **split_off_nodes = NULL; + char **split_off_procs = NULL; + char **split_onto_nodes = NULL; + opal_list_t *proc_list = NULL; + opal_list_t *node_list = NULL; + opal_list_t *suggested_map_list = NULL; + orte_errmgr_predicted_proc_t *off_proc = NULL; + orte_errmgr_predicted_node_t *off_node = NULL; + orte_errmgr_predicted_map_t *onto_map = NULL; + int cnt = 0, i; + + sender = &(mev->sender); + + /* + * If we are already interacting with a command line tool then reject this + * request. Since we only allow the processing of one tool command at a + * time. + */ + if( OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_NAME_INVALID, &errmgr_cmdline_sender) ) { + swap_dest.jobid = errmgr_cmdline_sender.jobid; + swap_dest.vpid = errmgr_cmdline_sender.vpid; + + errmgr_cmdline_sender = *sender; + orte_errmgr_base_migrate_update(ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS); + + errmgr_cmdline_sender.jobid = swap_dest.jobid; + errmgr_cmdline_sender.vpid = swap_dest.vpid; + + goto cleanup; + } + + errmgr_cmdline_sender = *sender; + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &command, &count, ORTE_ERRMGR_MIGRATE_TOOL_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * orte-migrate has requested that a checkpoint be taken + */ + if (ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD == command) { + OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output, + "errmgr:base:tool:recv() Command line requested process migration [command %d]\n", + command)); + + /* + * Unpack the buffer from the orte-migrate command + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(off_procs), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(off_nodes), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(onto_nodes), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Parse the comma separated list + */ + proc_list = OBJ_NEW(opal_list_t); + node_list = OBJ_NEW(opal_list_t); + suggested_map_list = OBJ_NEW(opal_list_t); + + split_off_procs = opal_argv_split(off_procs, ','); + cnt = opal_argv_count(split_off_procs); + if( cnt > 0 ) { + for(i = 0; i < cnt; ++i) { + off_proc = OBJ_NEW(orte_errmgr_predicted_proc_t); + off_proc->proc_name.vpid = atoi(split_off_procs[i]); + opal_list_append(proc_list, &(off_proc->super)); + } + } + + split_off_nodes = opal_argv_split(off_nodes, ','); + cnt = opal_argv_count(split_off_nodes); + if( cnt > 0 ) { + for(i = 0; i < cnt; ++i) { + off_node = OBJ_NEW(orte_errmgr_predicted_node_t); + off_node->node_name = strdup(split_off_nodes[i]); + opal_list_append(node_list, &(off_node->super)); + } + } + + split_onto_nodes = opal_argv_split(onto_nodes, ','); + cnt = opal_argv_count(split_onto_nodes); + if( cnt > 0 ) { + for(i = 0; i < cnt; ++i) { + onto_map = OBJ_NEW(orte_errmgr_predicted_map_t); + onto_map->map_node_name = strdup(split_onto_nodes[i]); + opal_list_append(suggested_map_list, &(onto_map->super)); + } + } + + /* + * Pass to the predicted fault function to see how they would like to progress + */ + orte_errmgr_base_predicted_fault(proc_list, node_list, suggested_map_list); + } + /* + * Unknown command + */ + else { + OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output, + "errmgr:base:tool:recv() Command line sent an unknown command (command %d)\n", + command)); + ORTE_ERROR_LOG(ORTE_ERR_NOT_SUPPORTED); + goto cleanup; + } + + cleanup: + /* release the message event */ + OBJ_RELEASE(mev); + + return; +} +#endif diff --git a/orte/mca/errmgr/base/errmgr_private.h b/orte/mca/errmgr/base/errmgr_private.h index 67fd37bb40..7da3b3a861 100644 --- a/orte/mca/errmgr/base/errmgr_private.h +++ b/orte/mca/errmgr/base/errmgr_private.h @@ -51,7 +51,7 @@ ORTE_DECLSPEC extern orte_errmgr_base_t orte_errmgr_base; /* Define the ERRMGR command flag */ typedef uint8_t orte_errmgr_cmd_flag_t; #define ORTE_ERRMGR_CMD OPAL_UINT8 - + /* define some commands */ #define ORTE_ERRMGR_ABORT_PROCS_REQUEST_CMD 0x01 #define ORTE_ERRMGR_REGISTER_CALLBACK_CMD 0x02 @@ -72,9 +72,10 @@ ORTE_DECLSPEC int orte_errmgr_base_abort(int error_code, char *fmt, ...) __opal_attribute_format__(__printf__, 2, 3) # endif ; -ORTE_DECLSPEC int orte_errmgr_base_predicted_fault(char ***proc_list, - char ***node_list, - char ***suggested_nodes); + +ORTE_DECLSPEC int orte_errmgr_base_predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map); ORTE_DECLSPEC int orte_errmgr_base_suggest_map_targets(orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list); diff --git a/orte/mca/errmgr/crmig/Makefile.am b/orte/mca/errmgr/crmig/Makefile.am new file mode 100644 index 0000000000..1fbd3ed2a5 --- /dev/null +++ b/orte/mca/errmgr/crmig/Makefile.am @@ -0,0 +1,38 @@ +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +dist_pkgdata_DATA = help-orte-errmgr-crmig.txt + +sources = \ + errmgr_crmig.h \ + errmgr_crmig_component.c \ + errmgr_crmig_module.c + +# Make the output library in this directory, and name it either +# mca__.la (for DSO builds) or libmca__.la +# (for static builds). + +if OMPI_BUILD_errmgr_crmig_DSO +component_noinst = +component_install = mca_errmgr_crmig.la +else +component_noinst = libmca_errmgr_crmig.la +component_install = +endif + +mcacomponentdir = $(pkglibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_errmgr_crmig_la_SOURCES = $(sources) +mca_errmgr_crmig_la_LDFLAGS = -module -avoid-version + +noinst_LTLIBRARIES = $(component_noinst) +libmca_errmgr_crmig_la_SOURCES = $(sources) +libmca_errmgr_crmig_la_LDFLAGS = -module -avoid-version diff --git a/orte/mca/errmgr/crmig/configure.m4 b/orte/mca/errmgr/crmig/configure.m4 new file mode 100644 index 0000000000..9c30fe1923 --- /dev/null +++ b/orte/mca/errmgr/crmig/configure.m4 @@ -0,0 +1,20 @@ +# -*- shell-script -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# MCA_errmgr_crmig_CONFIG([action-if-found], [action-if-not-found]) +# ----------------------------------------------------------- +AC_DEFUN([MCA_errmgr_crmig_CONFIG],[ + # If we don't want FT, don't compile this component + AS_IF([test "$opal_want_ft_cr" = "1"], + [$1], + [$2]) +])dnl diff --git a/orte/mca/errmgr/crmig/configure.params b/orte/mca/errmgr/crmig/configure.params new file mode 100644 index 0000000000..458f258c39 --- /dev/null +++ b/orte/mca/errmgr/crmig/configure.params @@ -0,0 +1,14 @@ +# -*- shell-script -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_INIT_FILE=errmgr_crmig_component.c +PARAM_CONFIG_FILES="Makefile" diff --git a/orte/mca/errmgr/crmig/errmgr_crmig.h b/orte/mca/errmgr/crmig/errmgr_crmig.h new file mode 100644 index 0000000000..77914b7b8b --- /dev/null +++ b/orte/mca/errmgr/crmig/errmgr_crmig.h @@ -0,0 +1,93 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * + * Checkpoint/Restart Process Migration (CRMIG) ErrMgr component + * + * Simple, braindead implementation. + */ + +#ifndef MCA_ERRMGR_CRMIG_EXPORT_H +#define MCA_ERRMGR_CRMIG_EXPORT_H + +#include "orte_config.h" + +#include "opal/mca/mca.h" +#include "opal/event/event.h" + +#include "orte/mca/filem/filem.h" +#include "orte/mca/errmgr/errmgr.h" + +BEGIN_C_DECLS + + /* + * Local Component structures + */ + struct orte_errmgr_crmig_component_t { + orte_errmgr_base_component_t super; /** Base Errmgr component */ + bool crmig_enabled; + bool timing_enabled; + }; + typedef struct orte_errmgr_crmig_component_t orte_errmgr_crmig_component_t; + OPAL_MODULE_DECLSPEC extern orte_errmgr_crmig_component_t mca_errmgr_crmig_component; + + int orte_errmgr_crmig_component_query(mca_base_module_t **module, int *priority); + + /* + * Module functions: Global + */ + int orte_errmgr_crmig_global_module_init(void); + int orte_errmgr_crmig_global_module_finalize(void); + + int orte_errmgr_crmig_global_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state); + + int orte_errmgr_crmig_global_predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_crmig_global_process_fault(orte_job_t *jdata, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_crmig_global_suggest_map_targets(orte_proc_t *proc, + orte_node_t *oldnode, + opal_list_t *node_list, + orte_errmgr_stack_state_t *stack_state); + + int orte_errmgr_crmig_global_ft_event(int state); + + /* + * Module functions: Local + */ + int orte_errmgr_crmig_local_module_init(void); + int orte_errmgr_crmig_local_module_finalize(void); + + int orte_errmgr_crmig_local_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_crmig_local_ft_event(int state); + + +END_C_DECLS + +#endif /* MCA_ERRMGR_CRMIG_EXPORT_H */ diff --git a/orte/mca/errmgr/crmig/errmgr_crmig_component.c b/orte/mca/errmgr/crmig/errmgr_crmig_component.c new file mode 100644 index 0000000000..85cb378bf5 --- /dev/null +++ b/orte/mca/errmgr/crmig/errmgr_crmig_component.c @@ -0,0 +1,142 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" +#include "opal/util/output.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" +#include "errmgr_crmig.h" + +/* + * Public string for version number + */ +const char *orte_errmgr_crmig_component_version_string = + "ORTE ERRMGR crmig MCA component version " ORTE_VERSION; + +/* + * Local functionality + */ +static int errmgr_crmig_open(void); +static int errmgr_crmig_close(void); + +/* + * Instantiate the public struct with all of our public information + * and pointer to our public functions in it + */ +orte_errmgr_crmig_component_t mca_errmgr_crmig_component = { + /* First do the base component stuff */ + { + /* Handle the general mca_component_t struct containing + * meta information about the component itcrmig + */ + { + ORTE_ERRMGR_BASE_VERSION_3_0_0, + /* Component name and version */ + "crmig", + ORTE_MAJOR_VERSION, + ORTE_MINOR_VERSION, + ORTE_RELEASE_VERSION, + + /* Component open and close functions */ + errmgr_crmig_open, + errmgr_crmig_close, + orte_errmgr_crmig_component_query + }, + { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + /* Verbosity level */ + 0, + /* opal_output handler */ + -1, + /* Default priority */ + 40 + } +}; + +static int errmgr_crmig_open(void) +{ + int val; + + /* + * This should be the last componet to ever get used since + * it doesn't do anything. + */ + mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version, + "priority", + "Priority of the ERRMGR crmig component", + false, false, + mca_errmgr_crmig_component.super.priority, + &mca_errmgr_crmig_component.super.priority); + + mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version, + "verbose", + "Verbose level for the ERRMGR crmig component", + false, false, + mca_errmgr_crmig_component.super.verbose, + &mca_errmgr_crmig_component.super.verbose); + /* If there is a custom verbose level for this component than use it + * otherwise take our parents level and output channel + */ + if ( 0 != mca_errmgr_crmig_component.super.verbose) { + mca_errmgr_crmig_component.super.output_handle = opal_output_open(NULL); + opal_output_set_verbosity(mca_errmgr_crmig_component.super.output_handle, + mca_errmgr_crmig_component.super.verbose); + } else { + mca_errmgr_crmig_component.super.output_handle = orte_errmgr_base.output; + } + + mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version, + "timing", + "Enable Process Migration timer", + false, false, + 0, &val); + mca_errmgr_crmig_component.timing_enabled = OPAL_INT_TO_BOOL(val); + + mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version, + "enable", + "Enable Process Migration (Default: 0/off)", + false, false, + 0, &val); + mca_errmgr_crmig_component.crmig_enabled = OPAL_INT_TO_BOOL(val); + + /* + * Debug Output + */ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: open()"); + opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: open: priority = %d", + mca_errmgr_crmig_component.super.priority); + opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: open: verbosity = %d", + mca_errmgr_crmig_component.super.verbose); + opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: open: Proc. Mig. = %s", + (mca_errmgr_crmig_component.crmig_enabled ? "Enabled" : "Disabled")); + opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: open: timing = %s", + (mca_errmgr_crmig_component.timing_enabled ? "Enabled" : "Disabled")); + + return ORTE_SUCCESS; +} + +static int errmgr_crmig_close(void) +{ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: close()"); + + return ORTE_SUCCESS; +} diff --git a/orte/mca/errmgr/crmig/errmgr_crmig_module.c b/orte/mca/errmgr/crmig/errmgr_crmig_module.c new file mode 100644 index 0000000000..8b13e61deb --- /dev/null +++ b/orte/mca/errmgr/crmig/errmgr_crmig_module.c @@ -0,0 +1,1678 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" + +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ +#ifdef HAVE_STRING_H +#include +#endif + +#include "opal/util/show_help.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/argv.h" +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" + +#include "orte/util/error_strings.h" +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "opal/dss/dss.h" +#include "orte/mca/rml/rml.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/iof/iof.h" +#include "orte/mca/plm/plm.h" +#include "orte/mca/plm/base/base.h" +#include "orte/mca/plm/base/plm_private.h" +#include "orte/mca/filem/filem.h" +#include "orte/mca/grpcomm/grpcomm.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/rmaps/rmaps_types.h" +#include "orte/mca/routed/routed.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/mca/snapc/base/base.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" + +#include "errmgr_crmig.h" + +#include MCA_timer_IMPLEMENTATION_HEADER + +/****************** + * Crmig module + ******************/ +static orte_errmgr_base_module_t global_module = { + /** Initialization Function */ + orte_errmgr_crmig_global_module_init, + /** Finalization Function */ + orte_errmgr_crmig_global_module_finalize, + /** Update State */ + orte_errmgr_crmig_global_update_state, + orte_errmgr_crmig_global_predicted_fault, + /*orte_errmgr_crmig_global_process_fault,*/ + orte_errmgr_crmig_global_suggest_map_targets, + orte_errmgr_crmig_global_ft_event +}; + +static orte_errmgr_base_module_t local_module = { + /** Initialization Function */ + orte_errmgr_crmig_local_module_init, + /** Finalization Function */ + orte_errmgr_crmig_local_module_finalize, + /** Update State */ + orte_errmgr_crmig_local_update_state, + NULL, + NULL, + orte_errmgr_crmig_local_ft_event +}; + +/************************************ + * Locally Global vars & functions :) + ************************************/ +static orte_jobid_t current_global_jobid = ORTE_JOBID_INVALID; +static orte_job_t *current_global_jobdata = NULL; + +static bool migrating_underway = false; +static bool migrating_terminated = false; +static bool migrating_restarted = false; + +static opal_list_t *current_onto_mapping_general = NULL; +static opal_list_t *current_onto_mapping_exclusive = NULL; + +/*** Command Line Interactions */ +static int current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_NONE; + +static int errmgr_crmig_global_migrate(opal_list_t *off_procs, opal_list_t *off_nodes, opal_list_t *onto_map); + +static void errmgr_crmig_process_fault_app(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state); +static void errmgr_crmig_process_fault_daemon(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state); + +static bool check_if_duplicate_proc(orte_proc_t *proc, opal_pointer_array_t *migrating_procs); +static int check_if_terminated(opal_pointer_array_t *migrating_procs); +static int check_if_restarted(opal_pointer_array_t *migrating_procs); + +static int check_and_pre_map(opal_list_t *off_procs, + opal_list_t *off_nodes, + orte_snapc_base_quiesce_t *cur_datum); + +static void display_request(opal_list_t *off_procs, + opal_list_t *off_nodes, + orte_snapc_base_quiesce_t *cur_datum); + +static void update_proc(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_exit_code_t exit_code); + +/* + * Timer stuff + */ +static void errmgr_crmig_set_time(int idx); +static void errmgr_crmig_display_all_timers(void); +static void errmgr_crmig_clear_timers(void); + +static double errmgr_crmig_get_time(void); +static void errmgr_crmig_display_indv_timer_core(double diff, char *str); +static double timer_start[OPAL_CR_TIMER_MAX]; + +#define ERRMGR_CRMIG_TIMER_START 0 +#define ERRMGR_CRMIG_TIMER_SETUP 1 +#define ERRMGR_CRMIG_TIMER_CKPT 2 +#define ERRMGR_CRMIG_TIMER_TERM 3 +#define ERRMGR_CRMIG_TIMER_RESETUP 4 +#define ERRMGR_CRMIG_TIMER_RESTART 5 +#define ERRMGR_CRMIG_TIMER_FINISH 6 +#define ERRMGR_CRMIG_TIMER_MAX 7 + +#define ERRMGR_CRMIG_CLEAR_TIMERS() \ + { \ + if(OPAL_UNLIKELY(mca_errmgr_crmig_component.timing_enabled > 0)) { \ + errmgr_crmig_clear_timers(); \ + } \ + } + +#define ERRMGR_CRMIG_SET_TIMER(idx) \ + { \ + if(OPAL_UNLIKELY(mca_errmgr_crmig_component.timing_enabled > 0)) { \ + errmgr_crmig_set_time(idx); \ + } \ + } + +#define ERRMGR_CRMIG_DISPLAY_ALL_TIMERS() \ + { \ + if(OPAL_UNLIKELY(mca_errmgr_crmig_component.timing_enabled > 0)) { \ + errmgr_crmig_display_all_timers(); \ + } \ + } + +/************************ + * Function Definitions + ************************/ +/* + * MCA Functions + */ +int orte_errmgr_crmig_component_query(mca_base_module_t **module, int *priority) +{ + if( !(orte_enable_recovery) ) { + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: component_query() - Disabled: Recovery is not enabled"); + *priority = -1; + *module = NULL; + return ORTE_SUCCESS; + } + + if( !mca_errmgr_crmig_component.crmig_enabled ) { + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: component_query() - Disabled: Process Migration " + "is not enabled via errmgr_crmig_enable MCA parameter."); + *priority = -1; + *module = NULL; + return ORTE_SUCCESS; + } + + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: component_query()"); + + *priority = mca_errmgr_crmig_component.super.priority; + if( ORTE_PROC_IS_HNP ) { + *module = (mca_base_module_t *)&global_module; + } + else if (ORTE_PROC_IS_DAEMON) { + *module = (mca_base_module_t *)&local_module; + } + else { + *module = NULL; + } + + return ORTE_SUCCESS; +} + +/************************ + * Function Definitions: Global + ************************/ +int orte_errmgr_crmig_global_module_init(void) +{ + int ret; + + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: init()"); + + migrating_underway = false; + + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + /* + * Initialize the connection to the orte-migrate tool + */ + if( ORTE_SUCCESS != (ret = orte_errmgr_base_tool_init()) ) { + ORTE_ERROR_LOG(ret); + return ret; + } + + ERRMGR_CRMIG_CLEAR_TIMERS(); + + return ORTE_SUCCESS; +} + +int orte_errmgr_crmig_global_module_finalize(void) +{ + int ret; + + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: finalize()"); + + /* + * Finalize the connection to the orte-migrate tool + */ + if( ORTE_SUCCESS != (ret = orte_errmgr_base_tool_finalize()) ) { + ORTE_ERROR_LOG(ret); + return ret; + } + + migrating_underway = false; + + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + ERRMGR_CRMIG_CLEAR_TIMERS(); + + return ORTE_SUCCESS; +} + +int orte_errmgr_crmig_global_predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, + orte_errmgr_stack_state_t *stack_state) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_job_t *jdata = NULL; + int i; + + /* + * JJH: RETURN HERE + * If we are already migrating, then reject this request + */ + if( migrating_underway ) { + ; + } + + /* + * Determine the jobid for this migration + * JJH: Assumes only one job active at any one time + */ + for(i = 0; i < orte_job_data->size; ++i ) { + if (NULL == (jdata = (orte_job_t*)opal_pointer_array_get_item(orte_job_data, i))) { + continue; + } + /* Exclude outselves */ + if( jdata->jobid == ORTE_PROC_MY_NAME->jobid ) { + continue; + } + current_global_jobdata = jdata; + current_global_jobid = jdata->jobid; + break; + } + if( NULL == current_global_jobdata ) { + opal_output(0, "errmgr:crmig:predicted_fault(): Global) Error: Cannot find the jdata for the current job."); + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + current_global_jobdata->controls |= ORTE_JOB_CONTROL_RECOVERABLE; + + current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_REQUEST; + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_update(current_migration_status)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /************************* + * Kick off the migration + *************************/ + if( ORTE_SUCCESS != (ret = errmgr_crmig_global_migrate(proc_list, node_list, suggested_map)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /************************ + * Set up the Command Line listener again + *************************/ + current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_NONE; + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_update(current_migration_status)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + opal_show_help("help-orte-errmgr-crmig.txt", "migrated_job", true); + + cleanup: + return exit_status; +} + +int orte_errmgr_crmig_global_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state) +{ + orte_job_t *jdata = NULL; + int ret = ORTE_SUCCESS; + + /* + * if orte is trying to shutdown, just let it + */ + if (orte_finalizing) { + return ORTE_SUCCESS; + } + + OPAL_OUTPUT_VERBOSE((1, orte_errmgr_base.output, + "%s errmgr:crmig: job %s reported state %s" + " for proc %s state %s exit_code %d", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_JOBID_PRINT(job), + orte_job_state_to_str(jobstate), + (NULL == proc_name) ? "NULL" : ORTE_NAME_PRINT(proc_name), + orte_proc_state_to_str(state), exit_code)); + + /* get the job data object for this process */ + if (NULL == (jdata = orte_get_job_data_object(job))) { + ret = ORTE_ERROR; + ORTE_ERROR_LOG(ret); + return ret; + } + + if( ORTE_PROC_STATE_ABORTED_BY_SIG == state || + ORTE_PROC_STATE_COMM_FAILED == state ) { + if( ORTE_SUCCESS != (ret = orte_errmgr_crmig_global_process_fault(jdata, proc_name, state, stack_state)) ) { + ORTE_ERROR_LOG(ret); + return ret; + } + } + else if( ORTE_PROC_STATE_KILLED_BY_CMD == state ) { + if( migrating_underway ) { + /* If we are migrating, then we need to mask this to prevent the lower level from terminating us */ + update_proc(jdata, proc_name, state, exit_code); + *stack_state ^= ORTE_ERRMGR_STACK_STATE_JOB_ABORT; + *stack_state |= ORTE_ERRMGR_STACK_STATE_RECOVERED; + } + } + + return ORTE_SUCCESS; +} + +int orte_errmgr_crmig_global_process_fault(orte_job_t *jdata, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state) +{ + /* + * JJH: Todo + * The expected logic here is: + * if( a daemon with children fails ) { + * abort migration. + * } + * if( a daemon without children fails ) { + * continue. No processes lost + * } + * if( an application process fails ) { + * abort migration. Might be a bad checkpoint, or a process that we were + * not migrating that died. + * } + * else { + * continue; + * } + */ + if( proc_name->jobid == ORTE_PROC_MY_NAME->jobid ) { + errmgr_crmig_process_fault_daemon(jdata, proc_name, state, stack_state); + } else { + errmgr_crmig_process_fault_app(jdata, proc_name, state, stack_state); + } + + return ORTE_SUCCESS; +} + +int orte_errmgr_crmig_global_suggest_map_targets(orte_proc_t *proc, + orte_node_t *oldnode, + opal_list_t *node_list, + orte_errmgr_stack_state_t *stack_state) +{ + int exit_status = ORTE_SUCCESS; + opal_list_item_t *item = NULL, *m_item = NULL; + orte_errmgr_predicted_map_t *onto_map = NULL, *current_proc_map = NULL; + orte_node_t *node = NULL; + bool found = false; + int num_suggested = 0; + orte_std_cntr_t i_proc; + orte_proc_t *peer_proc = NULL; + + /* + * If not migrating, then suggest nothing + */ + if( !migrating_underway ) { + return ORTE_SUCCESS; + } + + /* + * First look for an exclusive mapping for this process + */ + for(item = opal_list_get_first(current_onto_mapping_exclusive); + item != opal_list_get_end(current_onto_mapping_exclusive); + item = opal_list_get_next(item) ) { + onto_map = (orte_errmgr_predicted_map_t*) item; + if( onto_map->proc_name.vpid == proc->name.vpid ) { + current_proc_map = onto_map; + break; + } + } + + /* + * If there is an exclusive mapping then... + */ + if( NULL != current_proc_map ) { + /* + * If we made an exclusive mapping during the check_and_pre_map() + * then honor it here. + */ + if( NULL != current_proc_map->pre_map_fixed_node ) { + for( item = opal_list_get_first(node_list); + item != opal_list_get_end(node_list); + item = opal_list_get_next(item) ) { + node = (orte_node_t*)item; + + /* Exclude all other nodes */ + found = false; + + if( 0 == strncmp(node->name, current_proc_map->pre_map_fixed_node, + strlen(current_proc_map->pre_map_fixed_node)) ) { + found = true; + break; + } + if( !found ) { + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + continue; + } else { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Fixed use of node [%15s : %10s -> %10s (%10s)] -------", + ORTE_NAME_PRINT(&proc->name), oldnode->name, + current_proc_map->pre_map_fixed_node, node->name)); + } + } + + /* All done with mapping */ + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + /* + * If 'off_current_node' then exclude current node + */ + if( current_proc_map->off_current_node ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Remove old node (info) [%15s : %10s] -------", + ORTE_NAME_PRINT(&proc->name), oldnode->name)); + for( item = opal_list_get_first(node_list); + item != opal_list_get_end(node_list); + item = opal_list_get_next(item) ) { + node = (orte_node_t*)item; + + /* Exclude the old node */ + if( node == oldnode ) { + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + break; + } + } + } + + /* + * If 'map_proc_name' then map to the node where this process resides + * Note: Only do this if there was no 'other' node suggested. If there + * was an 'other' node suggested then we need to honor that before + * we honor the peer suggestion. + */ + if( ORTE_VPID_INVALID != current_proc_map->map_proc_name.vpid && + current_proc_map->proc_name.vpid != current_proc_map->map_proc_name.vpid && + NULL == current_proc_map->map_node_name ) { + /* + * Find the node containting the target process + */ + for(i_proc = 0; i_proc < opal_pointer_array_get_size(current_global_jobdata->procs); ++i_proc) { + peer_proc = (orte_proc_t*)opal_pointer_array_get_item(current_global_jobdata->procs, i_proc); + if( NULL == peer_proc ) { + continue; + } + if( peer_proc->name.vpid == current_proc_map->map_proc_name.vpid ) { + current_proc_map->map_node_name = strdup(peer_proc->node->name); + break; + } + } + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Force use of node with proc [%15s -> %15s: %10s -> %10s] -------", + ORTE_NAME_PRINT(&proc->name), ORTE_NAME_PRINT(&peer_proc->name), + oldnode->name, current_proc_map->map_node_name)); + } + + /* + * If 'map_node_name' then use this node exclusively + */ + if( NULL != current_proc_map->map_node_name ) { + for( item = opal_list_get_first(node_list); + item != opal_list_get_end(node_list); + item = opal_list_get_next(item) ) { + node = (orte_node_t*)item; + + /* Exclude all nodes not in the include list */ + found = false; + + if( 0 == strncmp(node->name, current_proc_map->map_node_name, strlen(current_proc_map->map_node_name)) ) { + found = true; + } + if( !found ) { + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + continue; + } else { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Force use of node [%15s : %10s -> %10s (%10s)] -------", + ORTE_NAME_PRINT(&proc->name), oldnode->name, + current_proc_map->map_node_name, node->name)); + } + } + + /* All done with mapping */ + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + /* + * Otherwise then map as if there was no exclusive mapping + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Suggesting as if non-exclusive [%15s : 0x%x : %10s] -------", + ORTE_NAME_PRINT(&proc->name), proc->state, oldnode->name)); + } + /* + * If no exclusive mapping (or exclusive did not yield any results) then... + */ + else { + /* + * Remove the old node from the list, if there are more than 1 nodes available + */ + if(1 < opal_list_get_size(node_list) ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Remove old node [%15s : %10s] -------", + ORTE_NAME_PRINT(&proc->name), oldnode->name)); + for( item = opal_list_get_first(node_list); + item != opal_list_get_end(node_list); + item = opal_list_get_next(item) ) { + node = (orte_node_t*)item; + + /* Exclude the old node */ + if( node == oldnode ) { + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + break; + } + } + } + } + + /* + * If we do not have any general suggestions, then just return + */ + if( opal_list_get_size(current_onto_mapping_general) <= 0 ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- No suggestions for target [%15s : 0x%x : %10s] -------", + ORTE_NAME_PRINT(&proc->name), proc->state, oldnode->name)); + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + /* + * Otherwise look through the general suggestions as an include list + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Suggest a target for [%15s : 0x%x : %10s] -------", + ORTE_NAME_PRINT(&proc->name), proc->state, oldnode->name)); + + num_suggested = 0; + for( item = opal_list_get_first(node_list); + item != opal_list_get_end(node_list); + item = opal_list_get_next(item) ) { + node = (orte_node_t*)item; + + /* Exclude all nodes not in the include list */ + found = false; + + for(m_item = opal_list_get_first(current_onto_mapping_general); + m_item != opal_list_get_end(current_onto_mapping_general); + m_item = opal_list_get_next(m_item) ) { + onto_map = (orte_errmgr_predicted_map_t*) m_item; + + if( 0 == strncmp(node->name, onto_map->map_node_name, strlen(onto_map->map_node_name)) ) { + found = true; + break; + } + } + if( !found ) { + opal_list_remove_item(node_list, item); + OBJ_RELEASE(item); + continue; + } + + ++num_suggested; + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Suggesting target %2d [%15s : 0x%x : %10s -> %10s] -------", + num_suggested, ORTE_NAME_PRINT(&proc->name), proc->state, oldnode->name, node->name)); + } + + cleanup: + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:suggest() ------- Suggested %2d nodes for [%15s : 0x%x : %10s] -------", + (int)opal_list_get_size(node_list), ORTE_NAME_PRINT(&proc->name), proc->state, oldnode->name)); + + return exit_status; +} + +int orte_errmgr_crmig_global_ft_event(int state) +{ + return ORTE_SUCCESS; +} + +/************************ + * Function Definitions: Global + ************************/ +int orte_errmgr_crmig_local_module_init(void) +{ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: init() (Local)"); + + migrating_underway = false; + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + return ORTE_SUCCESS; +} + +int orte_errmgr_crmig_local_module_finalize(void) +{ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig: finalize() (Local)"); + + migrating_underway = false; + current_global_jobid = ORTE_JOBID_INVALID; + current_global_jobdata = NULL; + + return ORTE_SUCCESS; +} + +int orte_errmgr_crmig_local_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + pid_t pid, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state) +{ + /* + * If this component is enabled, then the global version takes care of + * recovery policy. Tell lower layers in the ErrMgr stack -not- to recover + * locally. + */ + if( ORTE_PROC_STATE_KILLED_BY_CMD == state ) { + *stack_state ^= ORTE_ERRMGR_STACK_STATE_JOB_ABORT; + *stack_state |= ORTE_ERRMGR_STACK_STATE_RECOVERED; + } + + OPAL_OUTPUT_VERBOSE((1, orte_errmgr_base.output, + "%s errmgr:crmig: update_state() (Local) job state %s" + " for proc %s state %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + orte_job_state_to_str(jobstate), + (NULL == proc_name) ? "NULL" : ORTE_NAME_PRINT(proc_name), + orte_proc_state_to_str(state) )); + + return ORTE_SUCCESS; +} + +int orte_errmgr_crmig_local_ft_event(int state) +{ + return ORTE_SUCCESS; +} + + + +static int errmgr_crmig_global_migrate(opal_list_t *off_procs, opal_list_t *off_nodes, opal_list_t *onto_maps) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t i_node; + orte_std_cntr_t i_proc; + orte_node_t *node = NULL; + orte_proc_t *proc = NULL; + bool found = false; + orte_snapc_base_quiesce_t *cur_datum = NULL; + bool close_iof_stdin = false; + orte_process_name_t iof_name = {ORTE_JOBID_INVALID, 0}; + char * err_str_procs = NULL; + char * err_str_nodes = NULL; + char * tmp_str = NULL; + orte_errmgr_predicted_proc_t *off_proc = NULL; + orte_errmgr_predicted_node_t *off_node = NULL; + orte_errmgr_predicted_map_t *onto_map = NULL; + opal_list_item_t *item = NULL; + + ERRMGR_CRMIG_CLEAR_TIMERS(); + ERRMGR_CRMIG_SET_TIMER(ERRMGR_CRMIG_TIMER_START); + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Migrating (%3d, %3d, %3d) -------", + (int)opal_list_get_size(off_procs), + (int)opal_list_get_size(off_nodes), + (int)opal_list_get_size(onto_maps))); + + /* + * Modeled after orte_plm_base_reset_job + */ + cur_datum = OBJ_NEW(orte_snapc_base_quiesce_t); + cur_datum->migrating = true; + migrating_underway = true; + + current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_RUNNING; + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_update(current_migration_status)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Check to make sure that the 'off' and 'onto' nodes exist + * - if 'onto' nodes do not, then add them (JJH XXX) + * - if 'off' nodes do not, then return an error (JJH XXX) + * JJH TODO... + */ + + /* + * Copy over the onto_nodes so we can suggest them later + */ + if( NULL != current_onto_mapping_general ) { + OBJ_RELEASE(current_onto_mapping_general); + current_onto_mapping_general = NULL; + } + if( NULL != current_onto_mapping_exclusive ) { + OBJ_RELEASE(current_onto_mapping_exclusive); + current_onto_mapping_exclusive = NULL; + } + current_onto_mapping_general = OBJ_NEW(opal_list_t); + current_onto_mapping_exclusive = OBJ_NEW(opal_list_t); + if( NULL != onto_maps ) { + while( NULL != (item = opal_list_remove_first(onto_maps)) ) { + onto_map = (orte_errmgr_predicted_map_t*) item; + /* Determine if process exclude mapping, or general */ + if( onto_map->proc_name.vpid == ORTE_VPID_INVALID ) { + opal_list_append(current_onto_mapping_general, item); + } else { + opal_list_append(current_onto_mapping_exclusive, item); + } + } + } + + for(item = opal_list_get_first(current_onto_mapping_exclusive); + item != opal_list_get_end(current_onto_mapping_exclusive); + item = opal_list_get_next(item) ) { + onto_map = (orte_errmgr_predicted_map_t*) item; + /* + * Find the node currently containing this process + */ + found = false; + for(i_proc = 0; i_proc < opal_pointer_array_get_size(current_global_jobdata->procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(current_global_jobdata->procs, i_proc); + if( NULL == proc ) { + continue; + } + + if( proc->name.vpid == onto_map->proc_name.vpid) { + found = true; + break; + } + } + + /* + * Check to see if this process hsould be skipped + */ + if( !onto_map->off_current_node && + (ORTE_VPID_INVALID == onto_map->map_proc_name.vpid || + onto_map->proc_name.vpid == onto_map->map_proc_name.vpid ) && + (NULL == onto_map->map_node_name || + 0 == strncmp(onto_map->map_node_name, proc->node->name, strlen(proc->node->name))) ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Process %15s does not wish to move -------", + ORTE_NAME_PRINT(&proc->name))); + + } else { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Process %15s will be moved -------", + ORTE_NAME_PRINT(&proc->name))); + /* + * Set the process to restarting + */ + proc->state = ORTE_PROC_STATE_MIGRATING; + + opal_pointer_array_add(&(cur_datum->migrating_procs), (void*)proc); + OBJ_RETAIN(proc); + (cur_datum->num_migrating)++; + + if( current_global_jobdata->stdin_target == proc->name.vpid ) { + close_iof_stdin = true; + iof_name.jobid = proc->name.jobid; + iof_name.vpid = proc->name.vpid; + } + } + } + + migrating_terminated = false; + migrating_restarted = false; + + /* + * Create a list of processes to migrate, if 'off_nodes' specified + */ + for(item = opal_list_get_first(off_nodes); + item != opal_list_get_end(off_nodes); + item = opal_list_get_next(item) ) { + off_node = (orte_errmgr_predicted_node_t*)item; + + /* + * Find the node in the job structure + * - Make sure that 'odin00' doesn't match all 'odin00*' + */ + found = false; + for(i_node = 0; i_node < opal_pointer_array_get_size(current_global_jobdata->map->nodes); ++i_node) { + node = (orte_node_t*)opal_pointer_array_get_item(current_global_jobdata->map->nodes, i_node); + if( NULL == node ) { + continue; + } + + if( 0 == strncmp(node->name, off_node->node_name, strlen(off_node->node_name)) ) { + found = true; + break; + } + } + if( !found ) { + ; /* Warn about invalid node */ + } else { + /* + * Add all processes from this node + */ + for(i_proc = 0; i_proc < opal_pointer_array_get_size(node->procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(node->procs, i_proc); + if( NULL == proc ) { + continue; + } + + /* + * Set the process to restarting + */ + proc->state = ORTE_PROC_STATE_MIGRATING; + + opal_pointer_array_add(&(cur_datum->migrating_procs), (void*)proc); + OBJ_RETAIN(proc); + (cur_datum->num_migrating)++; + + if( current_global_jobdata->stdin_target == proc->name.vpid ) { + close_iof_stdin = true; + iof_name.jobid = proc->name.jobid; + iof_name.vpid = proc->name.vpid; + } + } + } + } + + /* + * Create a list of processes to migrate, if 'off_procs' specified + */ + for(item = opal_list_get_first(off_procs); + item != opal_list_get_end(off_procs); + item = opal_list_get_next(item) ) { + off_proc = (orte_errmgr_predicted_proc_t*)item; + + /* + * Find the process in the job structure + */ + found = false; + for(i_proc = 0; i_proc < opal_pointer_array_get_size(current_global_jobdata->procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(current_global_jobdata->procs, i_proc); + if( NULL == proc ) { + continue; + } + + if( proc->name.vpid == off_proc->proc_name.vpid) { + found = true; + break; + } + } + /* + * Make sure the process is not listed multiple times + */ + if( found ) { + found = check_if_duplicate_proc(proc, &(cur_datum->migrating_procs)); + if( !found ) { + /* + * Set the process to restarting + */ + proc->state = ORTE_PROC_STATE_MIGRATING; + + opal_pointer_array_add(&(cur_datum->migrating_procs), (void*)proc); + OBJ_RETAIN(proc); + (cur_datum->num_migrating)++; + + if( current_global_jobdata->stdin_target == proc->name.vpid ) { + close_iof_stdin = true; + iof_name.jobid = proc->name.jobid; + iof_name.vpid = proc->name.vpid; + } + } + } + } + + /* + * If we did not find any processes to migrate, then throw a warning, and skip it. + */ + if( 0 >= cur_datum->num_migrating ) { + for(item = opal_list_get_first(off_nodes); + item != opal_list_get_end(off_nodes); + item = opal_list_get_next(item) ) { + off_node = (orte_errmgr_predicted_node_t*)item; + if( NULL != err_str_nodes ) { + asprintf(&tmp_str, "%s, %s", err_str_nodes, off_node->node_name); + free(err_str_nodes); + err_str_nodes = strdup(tmp_str); + free(tmp_str); + tmp_str = NULL; + } else { + asprintf(&err_str_nodes, "%s", off_node->node_name); + } + } + + for(item = opal_list_get_first(off_procs); + item != opal_list_get_end(off_procs); + item = opal_list_get_next(item) ) { + off_proc = (orte_errmgr_predicted_proc_t*)item; + if( NULL != err_str_procs ) { + asprintf(&tmp_str, "%s, %d", err_str_procs, (int)off_proc->proc_name.vpid); + free(err_str_procs); + err_str_procs = strdup(tmp_str); + free(tmp_str); + tmp_str = NULL; + } else { + asprintf(&err_str_procs, "%d", off_proc->proc_name.vpid); + } + } + + opal_show_help("help-orte-errmgr-crmig.txt", "no_migrating_procs", true, + err_str_nodes, + err_str_procs); + + current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_ERROR; + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_update(current_migration_status)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + goto cleanup; + } + + /* + * Final pass on the migration list to pre-map processes and remove + * processes that should not be migrated. + */ + if( ORTE_SUCCESS != (ret = check_and_pre_map(off_procs, off_nodes, cur_datum)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Display the request before processing it. + */ + display_request(off_procs, off_nodes, cur_datum); + + ERRMGR_CRMIG_SET_TIMER(ERRMGR_CRMIG_TIMER_SETUP); + + /* + * Checkpoint the job + * - Hold all non-migrating processes + * - Abort the marked processes + * - + */ + current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_RUN_CKPT; + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_update(current_migration_status)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Starting the checkpoint of job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + if( ORTE_SUCCESS != (ret = orte_snapc.start_ckpt(cur_datum)) ) { + opal_output(0, "errmgr:crmig:migrate() Error: Unable to start the checkpoint."); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + ERRMGR_CRMIG_SET_TIMER(ERRMGR_CRMIG_TIMER_CKPT); + + /* + * Terminate the migrating processes + */ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Terminate old processes in job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + orte_plm.terminate_procs(&cur_datum->migrating_procs); + + /* + * Clear the IOF stdin target if necessary + */ + if( close_iof_stdin ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Closing old STDIN target for job %s (%s)-------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid), + ORTE_NAME_PRINT(&iof_name) )); + + orte_iof.close(&iof_name, ORTE_IOF_STDIN); + } + + /* + * Wait for the processes to finish terminating + */ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Waiting for termination -------"); + + while( !migrating_terminated ) { + opal_progress(); + check_if_terminated(&(cur_datum->migrating_procs)); + } + + ERRMGR_CRMIG_SET_TIMER(ERRMGR_CRMIG_TIMER_TERM); + + /* + * Start remapping the processes + */ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Checkpoint finished, setting up job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_STARTUP; + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_update(current_migration_status)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Reset the job parameters for restart + * This will set the state of the job to 'restart' + */ + orte_plm_base_reset_job(current_global_jobdata); + + /* + * Adjust the application context information + */ + for(i_proc = 0; i_proc < opal_pointer_array_get_size(&(cur_datum->migrating_procs)); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(&(cur_datum->migrating_procs), i_proc); + if( NULL == proc ) { + continue; + } + + if( ORTE_SUCCESS != (ret = orte_errmgr_base_update_app_context_for_cr_recovery(current_global_jobdata, + proc, + &(cur_datum->ss_snapshot->local_snapshots))) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\tAdjusted: \"%s\" [0x%d] [%s]\n", + ORTE_NAME_PRINT(&proc->name), proc->state, proc->node->name)); + } + + ERRMGR_CRMIG_SET_TIMER(ERRMGR_CRMIG_TIMER_RESETUP); + + /* + * Restart the job + * - spawn function will remap and launch the replacement proc(s) + */ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Respawning migrating processes in job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + orte_plm.spawn(current_global_jobdata); + + + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Waiting for restart -------"); + + migrating_restarted = false; + while( !migrating_restarted ) { + opal_progress(); + check_if_restarted(&(cur_datum->migrating_procs)); + } + + ERRMGR_CRMIG_SET_TIMER(ERRMGR_CRMIG_TIMER_RESTART); + + /* + * Finish the checkpoint + */ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Reconnecting processes in job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + if( ORTE_SUCCESS != (ret = orte_snapc.end_ckpt(cur_datum)) ) { + opal_output(0, "errmgr:crmig:migrate() Error: Unable to end the checkpoint."); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * All done + */ + opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() ------- Finished migrating processes in job %s -------", + ORTE_JOBID_PRINT(current_global_jobdata->jobid)); + + OBJ_RELEASE(cur_datum); + + current_migration_status = ORTE_ERRMGR_MIGRATE_STATE_FINISH; + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_update(current_migration_status)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + ERRMGR_CRMIG_SET_TIMER(ERRMGR_CRMIG_TIMER_FINISH); + ERRMGR_CRMIG_DISPLAY_ALL_TIMERS(); + + cleanup: + migrating_underway = false; + migrating_terminated = false; + migrating_restarted = false; + + if( NULL != err_str_procs ) { + free(err_str_procs); + err_str_procs = NULL; + } + + if( NULL != err_str_nodes ) { + free(err_str_nodes); + err_str_nodes = NULL; + } + + return exit_status; +} + +static bool check_if_duplicate_proc(orte_proc_t *proc, opal_pointer_array_t *migrating_procs) +{ + orte_std_cntr_t i_proc; + orte_proc_t *loc_proc = NULL; + + for(i_proc = 0; i_proc < opal_pointer_array_get_size(migrating_procs); ++i_proc) { + loc_proc = (orte_proc_t*)opal_pointer_array_get_item(migrating_procs, i_proc); + if( NULL == loc_proc ) { + continue; + } + if( loc_proc->name.vpid == proc->name.vpid ) { + return true; + } + } + + return false; +} + +static int check_if_terminated(opal_pointer_array_t *migrating_procs) +{ + orte_std_cntr_t i_proc; + orte_proc_t *proc = NULL; + bool is_done; + + is_done = true; + for(i_proc = 0; i_proc < opal_pointer_array_get_size(migrating_procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(migrating_procs, i_proc); + if( NULL == proc ) { + continue; + } + + if( !(ORTE_PROC_STATE_KILLED_BY_CMD & proc->state) ) { + is_done = false; + break; + } + } + + if( is_done ) { + migrating_terminated = true; + } + else { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\t Still waiting for termination: \"%s\" [0x%x] != [0x%x]\n", + ORTE_NAME_PRINT(&proc->name), proc->state, ORTE_PROC_STATE_KILLED_BY_CMD)); + } + + return ORTE_SUCCESS; +} + +static int check_if_restarted(opal_pointer_array_t *migrating_procs) +{ + orte_std_cntr_t i_proc; + orte_proc_t *proc = NULL; + bool is_done; + + is_done = true; + for(i_proc = 0; i_proc < opal_pointer_array_get_size(migrating_procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(migrating_procs, i_proc); + if( NULL == proc ) { + continue; + } + + /* proc->state != ORTE_PROC_STATE_LAUNCHED */ + if( !(ORTE_PROC_STATE_RUNNING & proc->state) ) { + is_done = false; + break; + } + } + + if( is_done ) { + migrating_restarted = true; + } + else { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\tStill waiting for restart: \"%s\" [0x%x] != [0x%x]\n", + ORTE_NAME_PRINT(&proc->name), proc->state, ORTE_PROC_STATE_RUNNING)); + } + + return ORTE_SUCCESS; +} + +static void errmgr_crmig_process_fault_app(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state) +{ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:process_fault_app() " + "------- Application fault reported! proc %s (0x%x) " + "- %s", + ORTE_NAME_PRINT(proc), + state, + (migrating_underway ? "Migrating" : "Not Migrating") )); + + return; +} + +static void errmgr_crmig_process_fault_daemon(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state) +{ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:process_fault_daemon() " + "------- Daemon fault reported! proc %s (0x%x) " + "- %s", + ORTE_NAME_PRINT(proc), + state, + (migrating_underway ? "Migrating" : "Not Migrating") )); + + /* + * Failed communication can be ignored for the most part. + * Make sure to remove the route + * JJH: Check to make sure this is not a new daemon loss. + */ + if( ORTE_PROC_STATE_COMM_FAILED == state ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:process_fault_daemon() " + "------- Daemon fault reported! proc %s (0x%x) " + "- Communication failure, keep going", + ORTE_NAME_PRINT(proc), + state )); + } + + return; +} + +static int check_and_pre_map(opal_list_t *off_procs, + opal_list_t *off_nodes, + orte_snapc_base_quiesce_t *cur_datum) +{ + /* + * Check the 'off_procs' list for processes that should not be migrated + */ + + /* + * Check the 'current_onto_mapping_exclusive' for processes that are moving + * 'near/with' other processes that are also moving. Be sure to watch out + * for circular deadlock. + */ + + /* + * Use the 'pre_map_fixed_node' structure to fix this process' mapping. + */ + + return ORTE_SUCCESS; +} + +static void display_request(opal_list_t *off_procs, + opal_list_t *off_nodes, + orte_snapc_base_quiesce_t *cur_datum) +{ + orte_std_cntr_t i_node; + orte_std_cntr_t i_proc; + orte_node_t *node = NULL; + orte_proc_t *proc = NULL; + bool found = false; + char * status_str = NULL; + char * tmp_str = NULL; + orte_errmgr_predicted_proc_t *off_proc = NULL; + orte_errmgr_predicted_node_t *off_node = NULL; + orte_errmgr_predicted_map_t *onto_map = NULL; + opal_list_item_t *item = NULL; + + /* + * Display all requested processes to migrate + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() Requested Processes to migrate: (%d procs)\n", + (int) opal_list_get_size(off_procs) )); + for(item = opal_list_get_first(off_procs); + item != opal_list_get_end(off_procs); + item = opal_list_get_next(item) ) { + off_proc = (orte_errmgr_predicted_proc_t*)item; + + /* + * Find the process in the job structure + */ + found = false; + for(i_proc = 0; i_proc < opal_pointer_array_get_size(current_global_jobdata->procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(current_global_jobdata->procs, i_proc); + if( NULL == proc ) { + continue; + } + + if( proc->name.vpid == off_proc->proc_name.vpid) { + found = true; + break; + } + } + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\t%s (Rank %3d) on node %s\n", + ORTE_NAME_PRINT(&proc->name), (int)off_proc->proc_name.vpid, proc->node->name)); + } + + /* + * Display Off Nodes + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() Requested Nodes to migration: (%d nodes)\n", + (int)opal_list_get_size(off_nodes) )); + + for(item = opal_list_get_first(off_nodes); + item != opal_list_get_end(off_nodes); + item = opal_list_get_next(item) ) { + off_node = (orte_errmgr_predicted_node_t*)item; + + for(i_node = 0; i_node < opal_pointer_array_get_size(current_global_jobdata->map->nodes); ++i_node) { + node = (orte_node_t*)opal_pointer_array_get_item(current_global_jobdata->map->nodes, i_node); + if( NULL == node ) { + continue; + } + + found = false; + if( 0 == strncmp(node->name, off_node->node_name, strlen(off_node->node_name)) ) { + found = true; + break; + } + } + if( found ) { + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\t\"%s\" \t%d\n", + node->name, node->num_procs)); + for(i_proc = 0; i_proc < opal_pointer_array_get_size(node->procs); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(node->procs, i_proc); + if( NULL == proc ) { + continue; + } + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\t\t\"%s\" [0x%x]\n", + ORTE_NAME_PRINT(&proc->name), proc->state)); + } + } + } + + /* + * Suggested onto nodes + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() Suggested nodes to migration onto: (%d nodes)\n", + (int)opal_list_get_size(current_onto_mapping_general) )); + for(item = opal_list_get_first(current_onto_mapping_general); + item != opal_list_get_end(current_onto_mapping_general); + item = opal_list_get_next(item) ) { + onto_map = (orte_errmgr_predicted_map_t*) item; + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\t\"%s\"\n", + onto_map->map_node_name)); + } + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() Suggested nodes to migration onto (exclusive): (%d nodes)\n", + (int)opal_list_get_size(current_onto_mapping_exclusive) )); + for(item = opal_list_get_first(current_onto_mapping_exclusive); + item != opal_list_get_end(current_onto_mapping_exclusive); + item = opal_list_get_next(item) ) { + onto_map = (orte_errmgr_predicted_map_t*) item; + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\t%d\t(%c)\t\"%s\"\n", + onto_map->proc_name.vpid, + (onto_map->off_current_node ? 'T' : 'F'), + onto_map->map_node_name)); + } + + /* + * Display all processes scheduled to migrate + */ + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "errmgr:crmig:migrate() All Migrating Processes: (%d procs)\n", + cur_datum->num_migrating)); + for(i_proc = 0; i_proc < opal_pointer_array_get_size(&(cur_datum->migrating_procs)); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(&(cur_datum->migrating_procs), i_proc); + if( NULL == proc ) { + continue; + } + + OPAL_OUTPUT_VERBOSE((10, mca_errmgr_crmig_component.super.output_handle, + "\t\"%s\" [0x%x] [%s]\n", + ORTE_NAME_PRINT(&proc->name), proc->state, proc->node->name)); + + if( NULL == status_str ) { + asprintf(&status_str, "\t%s Rank %d on Node %s\n", + ORTE_NAME_PRINT(&proc->name), + (int)proc->name.vpid, + proc->node->name); + } else { + tmp_str = strdup(status_str); + free(status_str); + status_str = NULL; + asprintf(&status_str, "%s\t%s Rank %d on Node %s\n", + tmp_str, + ORTE_NAME_PRINT(&proc->name), + (int)proc->name.vpid, + proc->node->name); + } + } + + opal_show_help("help-orte-errmgr-crmig.txt", "migrating_job", true, + status_str); + + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + if( NULL != status_str ) { + free(status_str); + status_str = NULL; + } + + return; +} + +static void update_proc(orte_job_t *jdata, + orte_process_name_t *proc, + orte_proc_state_t state, + orte_exit_code_t exit_code) +{ + opal_list_item_t *item, *next; + orte_odls_child_t *child; + orte_proc_t *proct; + int i; + + /*** UPDATE LOCAL CHILD ***/ + for (item = opal_list_get_first(&orte_local_children); + item != opal_list_get_end(&orte_local_children); + item = next) { + next = opal_list_get_next(item); + child = (orte_odls_child_t*)item; + if (child->name->jobid == proc->jobid) { + if (child->name->vpid == proc->vpid) { + child->state = state; + child->exit_code = exit_code; + proct = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, child->name->vpid); + proct->state = state; + proct->exit_code = exit_code; + /* (JJH: See note below) + if (ORTE_PROC_STATE_UNTERMINATED < state) { + jdata->num_terminated++; + } + */ + return; + } + } + } + + /*** UPDATE REMOTE CHILD ***/ + for (i=0; i < jdata->procs->size; i++) { + if (NULL == (proct = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, i))) { + continue; + } + if (proct->name.jobid != proc->jobid || + proct->name.vpid != proc->vpid) { + continue; + } + proct->state = state; + proct->exit_code = exit_code; + if (ORTE_PROC_STATE_UNTERMINATED < state) { + /* JJH: Do not increment this value. Otherwise the 'hnp' component + * will try to terminate us after we request the job to + * termiante. So we fake it out by making sure that + * num_terminated never equals num_procs. + * There should be a better way though... + */ + /* update the counter so we can terminate */ + /*jdata->num_terminated++;*/ + } + return; + } +} + +/************************ + * Timing + ************************/ +static void errmgr_crmig_set_time(int idx) +{ + if(idx < ERRMGR_CRMIG_TIMER_MAX ) { + if( timer_start[idx] <= 0.0 ) { + timer_start[idx] = errmgr_crmig_get_time(); + } + } +} + +static void errmgr_crmig_display_all_timers(void) +{ + double diff = 0.0; + char * label = NULL; + + opal_output(0, "Process Migration Timing: ******************** Summary Begin\n"); + + /********** Structure Setup **********/ + label = strdup("Setup"); + diff = timer_start[ERRMGR_CRMIG_TIMER_SETUP] - timer_start[ERRMGR_CRMIG_TIMER_START]; + errmgr_crmig_display_indv_timer_core(diff, label); + free(label); + + /********** Checkpoint **********/ + label = strdup("Checkpoint"); + diff = timer_start[ERRMGR_CRMIG_TIMER_CKPT] - timer_start[ERRMGR_CRMIG_TIMER_SETUP]; + errmgr_crmig_display_indv_timer_core(diff, label); + free(label); + + /********** Termination **********/ + label = strdup("Terminate"); + diff = timer_start[ERRMGR_CRMIG_TIMER_TERM] - timer_start[ERRMGR_CRMIG_TIMER_CKPT]; + errmgr_crmig_display_indv_timer_core(diff, label); + free(label); + + /********** Setup new job **********/ + label = strdup("Setup Relaunch"); + diff = timer_start[ERRMGR_CRMIG_TIMER_RESETUP] - timer_start[ERRMGR_CRMIG_TIMER_TERM]; + errmgr_crmig_display_indv_timer_core(diff, label); + free(label); + + /********** Restart **********/ + label = strdup("Restart"); + diff = timer_start[ERRMGR_CRMIG_TIMER_RESTART] - timer_start[ERRMGR_CRMIG_TIMER_RESETUP]; + errmgr_crmig_display_indv_timer_core(diff, label); + free(label); + + /********** Finish **********/ + label = strdup("Finalize"); + diff = timer_start[ERRMGR_CRMIG_TIMER_FINISH] - timer_start[ERRMGR_CRMIG_TIMER_RESTART]; + errmgr_crmig_display_indv_timer_core(diff, label); + free(label); + + opal_output(0, "Process Migration Timing: ******************** Summary End\n"); +} + +static void errmgr_crmig_clear_timers(void) +{ + int i; + for(i = 0; i < ERRMGR_CRMIG_TIMER_MAX; ++i) { + timer_start[i] = 0.0; + } +} + +static double errmgr_crmig_get_time(void) +{ + double wtime; + +#if OPAL_TIMER_USEC_NATIVE + wtime = (double)opal_timer_base_get_usec() / 1000000.0; +#else + struct timeval tv; + gettimeofday(&tv, NULL); + wtime = tv.tv_sec; + wtime += (double)tv.tv_usec / 1000000.0; +#endif + + return wtime; +} + +static void errmgr_crmig_display_indv_timer_core(double diff, char *str) +{ + double total = 0; + double perc = 0; + + total = timer_start[ERRMGR_CRMIG_TIMER_MAX-1] - timer_start[ERRMGR_CRMIG_TIMER_START]; + perc = (diff/total) * 100; + + opal_output(0, + "errmgr_crmig: timing: %-20s = %10.2f s\t%10.2f s\t%6.2f\n", + str, + diff, + total, + perc); + return; +} diff --git a/orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt b/orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt new file mode 100644 index 0000000000..44f251f756 --- /dev/null +++ b/orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt @@ -0,0 +1,27 @@ + -*- text -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for ORTE ErrMgr CRMig framework. +# +[migrating_job] +Notice: A migration of this job has been requested. + The processes below will be migrated. + Please standby. +%s +[migrated_job] +Notice: The processes have been successfully migrated to/from the specified + machines. +[no_migrating_procs] +Warning: Could not find any processes to migrate on the nodes specified. + You provided the following: +Nodes: %s +Procs: %s diff --git a/orte/mca/errmgr/errmgr.h b/orte/mca/errmgr/errmgr.h index d074f6f3ea..4519993cac 100644 --- a/orte/mca/errmgr/errmgr.h +++ b/orte/mca/errmgr/errmgr.h @@ -79,6 +79,70 @@ BEGIN_C_DECLS /* type definition */ typedef uint8_t orte_errmgr_stack_state_t; +/* + * Structure to describe a predicted process fault. + * + * This can be expanded in the future to support assurance levels, and + * additional information that may wish to be conveyed. + */ +struct orte_errmgr_predicted_proc_t { + /** This is an object, so must have a super */ + opal_list_item_t super; + + /** Process Name */ + orte_process_name_t proc_name; +}; +typedef struct orte_errmgr_predicted_proc_t orte_errmgr_predicted_proc_t; +OBJ_CLASS_DECLARATION(orte_errmgr_predicted_proc_t); + +/* + * Structure to describe a predicted node fault. + * + * This can be expanded in the future to support assurance levels, and + * additional information that may wish to be conveyed. + */ +struct orte_errmgr_predicted_node_t { + /** This is an object, so must have a super */ + opal_list_item_t super; + + /** Node Name */ + char * node_name; +}; +typedef struct orte_errmgr_predicted_node_t orte_errmgr_predicted_node_t; +OBJ_CLASS_DECLARATION(orte_errmgr_predicted_node_t); + +/* + * Structure to describe a suggested remapping element for a predicted fault. + * + * This can be expanded in the future to support weights , and + * additional information that may wish to be conveyed. + */ +struct orte_errmgr_predicted_map_t { + /** This is an object, so must have a super */ + opal_list_item_t super; + + /** Process Name (predicted to fail) */ + orte_process_name_t proc_name; + + /** Node Name (predicted to fail) */ + char * node_name; + + /** Process Name (Map to) */ + orte_process_name_t map_proc_name; + + /** Node Name (Map to) */ + char * map_node_name; + + /** Just off current node */ + bool off_current_node; + + /** Pre-map fixed node assignment */ + char * pre_map_fixed_node; +}; +typedef struct orte_errmgr_predicted_map_t orte_errmgr_predicted_map_t; +OBJ_CLASS_DECLARATION(orte_errmgr_predicted_map_t); + + /* * Macro definitions */ @@ -129,14 +193,15 @@ typedef int (*orte_errmgr_base_API_update_state_fn_t)(orte_jobid_t job, * * @param[in] proc_list List of processes (or NULL if none) * @param[in] node_list List of nodes (or NULL if none) - * @param[in] suggested_nodes List of suggested nodes to use on recovery (or NULL if none) + * @param[in] suggested_map List of mapping suggestions to use on recovery (or NULL if none) * * @retval ORTE_SUCCESS The operation completed successfully * @retval ORTE_ERROR An unspecifed error occurred */ -typedef int (*orte_errmgr_base_API_predicted_fault_fn_t)(char ***proc_list, - char ***node_list, - char ***suggested_nodes); +typedef int (*orte_errmgr_base_API_predicted_fault_fn_t)(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map); + /** * Suggest a node to map a restarting process onto * @@ -212,9 +277,9 @@ typedef int (*orte_errmgr_base_module_update_state_fn_t)(orte_jobid_t job, pid_t pid, orte_exit_code_t exit_code, orte_errmgr_stack_state_t *stack_state); -typedef int (*orte_errmgr_base_module_predicted_fault_fn_t)(char ***proc_list, - char ***node_list, - char ***suggested_nodes, +typedef int (*orte_errmgr_base_module_predicted_fault_fn_t)(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, orte_errmgr_stack_state_t *stack_state); typedef int (*orte_errmgr_base_module_suggest_map_targets_fn_t)(orte_proc_t *proc, orte_node_t *oldnode, diff --git a/orte/mca/errmgr/example/.ompi_ignore b/orte/mca/errmgr/example/.ompi_ignore new file mode 100644 index 0000000000..e69de29bb2 diff --git a/orte/mca/errmgr/example/Makefile.am b/orte/mca/errmgr/example/Makefile.am new file mode 100644 index 0000000000..ce65a773a7 --- /dev/null +++ b/orte/mca/errmgr/example/Makefile.am @@ -0,0 +1,38 @@ +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +dist_pkgdata_DATA = help-orte-errmgr-example.txt + +sources = \ + errmgr_example.h \ + errmgr_example_component.c \ + errmgr_example_module.c + +# Make the output library in this directory, and name it either +# mca__.la (for DSO builds) or libmca__.la +# (for static builds). + +if OMPI_BUILD_errmgr_example_DSO +component_noinst = +component_install = mca_errmgr_example.la +else +component_noinst = libmca_errmgr_example.la +component_install = +endif + +mcacomponentdir = $(pkglibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_errmgr_example_la_SOURCES = $(sources) +mca_errmgr_example_la_LDFLAGS = -module -avoid-version + +noinst_LTLIBRARIES = $(component_noinst) +libmca_errmgr_example_la_SOURCES = $(sources) +libmca_errmgr_example_la_LDFLAGS = -module -avoid-version diff --git a/orte/mca/errmgr/example/configure.m4 b/orte/mca/errmgr/example/configure.m4 new file mode 100644 index 0000000000..05c77e518f --- /dev/null +++ b/orte/mca/errmgr/example/configure.m4 @@ -0,0 +1,20 @@ +# -*- shell-script -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# MCA_errmgr_example_CONFIG([action-if-found], [action-if-not-found]) +# ----------------------------------------------------------- +AC_DEFUN([MCA_errmgr_example_CONFIG],[ + # If we don't want FT, don't compile this component + AS_IF([test "$opal_want_ft_cr" = "1"], + [$1], + [$2]) +])dnl diff --git a/orte/mca/errmgr/example/configure.params b/orte/mca/errmgr/example/configure.params new file mode 100644 index 0000000000..0845231495 --- /dev/null +++ b/orte/mca/errmgr/example/configure.params @@ -0,0 +1,14 @@ +# -*- shell-script -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_INIT_FILE=errmgr_example_component.c +PARAM_CONFIG_FILES="Makefile" diff --git a/orte/mca/errmgr/example/errmgr_example.h b/orte/mca/errmgr/example/errmgr_example.h new file mode 100644 index 0000000000..d0c1f8c31e --- /dev/null +++ b/orte/mca/errmgr/example/errmgr_example.h @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * + * Automatic Recovery Errmgr component + * + */ + +#ifndef MCA_ERRMGR_EXAMPLE_EXPORT_H +#define MCA_ERRMGR_EXAMPLE_EXPORT_H + +#include "orte_config.h" + +#include "opal/mca/mca.h" +#include "opal/event/event.h" + +#include "orte/mca/filem/filem.h" +#include "orte/mca/errmgr/errmgr.h" + +BEGIN_C_DECLS + + /* + * Local Component structures + */ + struct orte_errmgr_example_component_t { + orte_errmgr_base_component_t super; /** Base Errmgr component */ + }; + typedef struct orte_errmgr_example_component_t orte_errmgr_example_component_t; + OPAL_MODULE_DECLSPEC extern orte_errmgr_example_component_t mca_errmgr_example_component; + + int orte_errmgr_example_component_query(mca_base_module_t **module, int *priority); + + /* + * Module functions: Global + */ + int orte_errmgr_example_global_module_init(void); + int orte_errmgr_example_global_module_finalize(void); + + int orte_errmgr_example_global_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state); + + int orte_errmgr_example_global_predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_example_global_process_fault(orte_job_t *jdata, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state); + int orte_errmgr_example_global_suggest_map_targets(orte_proc_t *proc, + orte_node_t *oldnode, + opal_list_t *node_list, + orte_errmgr_stack_state_t *stack_state); + + int orte_errmgr_example_global_ft_event(int state); + + +END_C_DECLS + +#endif /* MCA_ERRMGR_EXAMPLE_EXPORT_H */ diff --git a/orte/mca/errmgr/example/errmgr_example_component.c b/orte/mca/errmgr/example/errmgr_example_component.c new file mode 100644 index 0000000000..43a47367e2 --- /dev/null +++ b/orte/mca/errmgr/example/errmgr_example_component.c @@ -0,0 +1,120 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" +#include "opal/util/output.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" +#include "errmgr_example.h" + +/* + * Public string for version number + */ +const char *orte_errmgr_example_component_version_string = + "ORTE ERRMGR Example MCA component version " ORTE_VERSION; + +/* + * Local functionality + */ +static int errmgr_example_open(void); +static int errmgr_example_close(void); + +/* + * Instantiate the public struct with all of our public information + * and pointer to our public functions in it + */ +orte_errmgr_example_component_t mca_errmgr_example_component = { + /* First do the base component stuff */ + { + /* Handle the general mca_component_t struct containing + * meta information about the component itexample + */ + { + ORTE_ERRMGR_BASE_VERSION_3_0_0, + /* Component name and version */ + "example", + ORTE_MAJOR_VERSION, + ORTE_MINOR_VERSION, + ORTE_RELEASE_VERSION, + + /* Component open and close functions */ + errmgr_example_open, + errmgr_example_close, + orte_errmgr_example_component_query + }, + { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + /* Verbosity level */ + 0, + /* opal_output handler */ + -1, + /* Default priority */ + 0 + } +}; + +static int errmgr_example_open(void) +{ + /* + * This should be the last componet to ever get used since + * it doesn't do anything. + */ + mca_base_param_reg_int(&mca_errmgr_example_component.super.base_version, + "priority", + "Priority of the ERRMGR example component", + false, false, + mca_errmgr_example_component.super.priority, + &mca_errmgr_example_component.super.priority); + + mca_base_param_reg_int(&mca_errmgr_example_component.super.base_version, + "verbose", + "Verbose level for the ERRMGR example component", + false, false, + mca_errmgr_example_component.super.verbose, + &mca_errmgr_example_component.super.verbose); + /* If there is a custom verbose level for this component than use it + * otherwise take our parents level and output channel + */ + if ( 0 != mca_errmgr_example_component.super.verbose) { + mca_errmgr_example_component.super.output_handle = opal_output_open(NULL); + opal_output_set_verbosity(mca_errmgr_example_component.super.output_handle, + mca_errmgr_example_component.super.verbose); + } else { + mca_errmgr_example_component.super.output_handle = orte_errmgr_base.output; + } + + /* + * Debug Output + */ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example: open()"); + opal_output_verbose(20, mca_errmgr_example_component.super.output_handle, + "errmgr:example: open: priority = %d", + mca_errmgr_example_component.super.priority); + opal_output_verbose(20, mca_errmgr_example_component.super.output_handle, + "errmgr:example: open: verbosity = %d", + mca_errmgr_example_component.super.verbose); + + return ORTE_SUCCESS; +} + +static int errmgr_example_close(void) +{ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example: close()"); + + return ORTE_SUCCESS; +} diff --git a/orte/mca/errmgr/example/errmgr_example_module.c b/orte/mca/errmgr/example/errmgr_example_module.c new file mode 100644 index 0000000000..87089654e2 --- /dev/null +++ b/orte/mca/errmgr/example/errmgr_example_module.c @@ -0,0 +1,187 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University. + * All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" + +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ +#ifdef HAVE_STRING_H +#include +#endif + +#include "opal/util/show_help.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/argv.h" +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" + +#include "orte/util/error_strings.h" +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "opal/dss/dss.h" +#include "orte/mca/rml/rml.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/iof/iof.h" +#include "orte/mca/plm/plm.h" +#include "orte/mca/plm/base/base.h" +#include "orte/mca/plm/base/plm_private.h" +#include "orte/mca/filem/filem.h" +#include "orte/mca/grpcomm/grpcomm.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/rmaps/rmaps_types.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/mca/snapc/base/base.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" + +#include "errmgr_example.h" + +#include MCA_timer_IMPLEMENTATION_HEADER + + +/****************** + * Automatic Recovery module + ******************/ +static orte_errmgr_base_module_t global_module = { + /** Initialization Function */ + orte_errmgr_example_global_module_init, + /** Finalization Function */ + orte_errmgr_example_global_module_finalize, + /** Update State */ + orte_errmgr_example_global_update_state, + orte_errmgr_example_global_predicted_fault, + /*orte_errmgr_example_global_process_fault,*/ + orte_errmgr_example_global_suggest_map_targets, + orte_errmgr_example_global_ft_event +}; + +/************************************ + * Locally Global vars & functions + ************************************/ + +/************************ + * Function Definitions + ************************/ +/* + * MCA Functions + */ +int orte_errmgr_example_component_query(mca_base_module_t **module, int *priority) +{ + if( !(orte_enable_recovery) ) { + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:component_query() - Disabled: Recovery is not enabled"); + + *priority = -1; + *module = NULL; + return ORTE_SUCCESS; + } + + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:component_query()"); + + *priority = mca_errmgr_example_component.super.priority; + if( ORTE_PROC_IS_HNP ) { + *module = (mca_base_module_t *)&global_module; + } + else { + *module = NULL; + } + + return ORTE_SUCCESS; +} + +/************************ + * Function Definitions + ************************/ +int orte_errmgr_example_global_module_init(void) +{ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:init()"); + + return ORTE_SUCCESS; +} + +int orte_errmgr_example_global_module_finalize(void) +{ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:finalize()"); + + return ORTE_SUCCESS; +} + +int orte_errmgr_example_global_predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, + orte_errmgr_stack_state_t *stack_state) +{ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:predicted_fault()"); + + return ORTE_SUCCESS; +} + +int orte_errmgr_example_global_update_state(orte_jobid_t job, + orte_job_state_t jobstate, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_exit_code_t exit_code, + orte_errmgr_stack_state_t *stack_state) +{ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:update_state(%s)", + ORTE_NAME_PRINT(proc_name)); + + return ORTE_SUCCESS; +} + +int orte_errmgr_example_global_process_fault(orte_job_t *jdata, + orte_process_name_t *proc_name, + orte_proc_state_t state, + orte_errmgr_stack_state_t *stack_state) +{ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:process_fault(%s)", + ORTE_NAME_PRINT(proc_name)); + + return ORTE_SUCCESS; +} + +int orte_errmgr_example_global_suggest_map_targets(orte_proc_t *proc, + orte_node_t *oldnode, + opal_list_t *node_list, + orte_errmgr_stack_state_t *stack_state) +{ + opal_output_verbose(10, mca_errmgr_example_component.super.output_handle, + "errmgr:example:suggest_map_targets()"); + + return ORTE_SUCCESS; +} + +int orte_errmgr_example_global_ft_event(int state) +{ + return ORTE_SUCCESS; +} + +/***************** + * Local Functions + *****************/ diff --git a/orte/mca/errmgr/example/help-orte-errmgr-example.txt b/orte/mca/errmgr/example/help-orte-errmgr-example.txt new file mode 100644 index 0000000000..d316c87553 --- /dev/null +++ b/orte/mca/errmgr/example/help-orte-errmgr-example.txt @@ -0,0 +1,14 @@ + -*- text -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for ORTE ErrMgr Example framework. +# diff --git a/orte/mca/errmgr/hnp/errmgr_hnp.c b/orte/mca/errmgr/hnp/errmgr_hnp.c index f0b0f24f60..514c0f1bbb 100644 --- a/orte/mca/errmgr/hnp/errmgr_hnp.c +++ b/orte/mca/errmgr/hnp/errmgr_hnp.c @@ -82,9 +82,9 @@ static int update_state(orte_jobid_t job, orte_exit_code_t exit_code, orte_errmgr_stack_state_t *stack_state); -static int predicted_fault(char ***proc_list, - char ***node_list, - char ***suggested_nodes, +static int predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, orte_errmgr_stack_state_t *stack_state); static int suggest_map_targets(orte_proc_t *proc, @@ -462,6 +462,7 @@ static int update_state(orte_jobid_t job, check_job_complete(jdata); break; } + /* delete the route */ orte_routed.delete_route(proc); /* purge the oob */ @@ -524,9 +525,9 @@ static int update_state(orte_jobid_t job, return ORTE_SUCCESS; } -static int predicted_fault(char ***proc_list, - char ***node_list, - char ***suggested_nodes, +static int predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, orte_errmgr_stack_state_t *stack_state) { return ORTE_ERR_NOT_IMPLEMENTED; diff --git a/orte/mca/errmgr/orted/errmgr_orted.c b/orte/mca/errmgr/orted/errmgr_orted.c index 81cddb2bec..23a9d3389a 100644 --- a/orte/mca/errmgr/orted/errmgr_orted.c +++ b/orte/mca/errmgr/orted/errmgr_orted.c @@ -62,9 +62,9 @@ static void killprocs(orte_jobid_t job, orte_vpid_t vpid); static int init(void); static int finalize(void); -static int predicted_fault(char ***proc_list, - char ***node_list, - char ***suggested_nodes, +static int predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, orte_errmgr_stack_state_t *stack_state); static int update_state(orte_jobid_t job, @@ -408,9 +408,10 @@ static int update_state(orte_jobid_t job, jobdat->num_local_procs--; OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output, - "%s errmgr:orted reporting proc %s aborted to HNP", + "%s errmgr:orted reporting proc %s aborted to HNP (local procs = %d)", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), - ORTE_NAME_PRINT(child->name))); + ORTE_NAME_PRINT(child->name), + jobdat->num_local_procs)); /* release the child object */ OBJ_RELEASE(child); @@ -578,9 +579,9 @@ static int update_state(orte_jobid_t job, return ORTE_SUCCESS; } -static int predicted_fault(char ***proc_list, - char ***node_list, - char ***suggested_nodes, +static int predicted_fault(opal_list_t *proc_list, + opal_list_t *node_list, + opal_list_t *suggested_map, orte_errmgr_stack_state_t *stack_state) { return ORTE_ERR_NOT_IMPLEMENTED; diff --git a/orte/mca/ess/env/ess_env_module.c b/orte/mca/ess/env/ess_env_module.c index 53953bf64b..a507eb9585 100644 --- a/orte/mca/ess/env/ess_env_module.c +++ b/orte/mca/ess/env/ess_env_module.c @@ -478,6 +478,26 @@ static int rte_ft_event(int state) exit_status = ret; goto cleanup; } + + if( orte_cr_continue_like_restart ) { + /* + * Barrier to make all processes have been successfully restarted before + * we try to remove some restart only files. + */ + if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier())) { + opal_output(0, "ess:env: ft_event(%2d): Failed in orte_grpcomm.barrier (%d)", + state, ret); + return ret; + } + + if( orte_cr_flush_restart_files ) { + OPAL_OUTPUT_VERBOSE((1, orte_ess_base_output, + "ess:env ft_event(%2d): %s " + "Cleanup restart files...", + state, ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); + opal_crs_base_cleanup_flush(); + } + } } /******** Restart Recovery ********/ else if (OPAL_CRS_RESTART == state ) { @@ -567,22 +587,6 @@ static int rte_ft_event(int state) goto cleanup; } - /* - * Session directory re-init - */ - if (orte_create_session_dirs) { - if (ORTE_SUCCESS != (ret = orte_session_dir(true, - orte_process_info.tmpdir_base, - orte_process_info.nodename, - NULL, /* Batch ID -- Not used */ - ORTE_PROC_MY_NAME))) { - exit_status = ret; - } - - opal_output_set_output_file_info(orte_process_info.proc_session_dir, - "output-", NULL, NULL); - } - /* * Notify Routed */ @@ -599,6 +603,40 @@ static int rte_ft_event(int state) goto cleanup; } + /* + * Barrier to make all processes have been successfully restarted before + * we try to remove some restart only files. + */ + if (ORTE_SUCCESS != (ret = orte_grpcomm.barrier())) { + opal_output(0, "ess:env ft_event(%2d): Failed in orte_grpcomm.barrier (%d)", + state, ret); + return ret; + } + if( orte_cr_flush_restart_files ) { + OPAL_OUTPUT_VERBOSE((1, orte_ess_base_output, + "ess:env ft_event(%2d): %s " + "Cleanup restart files...", + state, ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); + + opal_crs_base_cleanup_flush(); + } + + /* + * Session directory re-init + */ + if (orte_create_session_dirs) { + if (ORTE_SUCCESS != (ret = orte_session_dir(true, + orte_process_info.tmpdir_base, + orte_process_info.nodename, + NULL, /* Batch ID -- Not used */ + ORTE_PROC_MY_NAME))) { + exit_status = ret; + } + + opal_output_set_output_file_info(orte_process_info.proc_session_dir, + "output-", NULL, NULL); + } + /* * Notify SnapC */ @@ -607,7 +645,6 @@ static int rte_ft_event(int state) exit_status = ret; goto cleanup; } - } else if (OPAL_CRS_TERM == state ) { /* Nothing */ diff --git a/orte/mca/filem/rsh/filem_rsh.h b/orte/mca/filem/rsh/filem_rsh.h index 3c62b5b962..e774803159 100644 --- a/orte/mca/filem/rsh/filem_rsh.h +++ b/orte/mca/filem/rsh/filem_rsh.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -56,6 +56,7 @@ BEGIN_C_DECLS extern int orte_filem_rsh_max_incomming; extern int orte_filem_rsh_max_outgoing; + extern int orte_filem_rsh_progress_meter; int orte_filem_rsh_component_query(mca_base_module_t **module, int *priority); diff --git a/orte/mca/filem/rsh/filem_rsh_component.c b/orte/mca/filem/rsh/filem_rsh_component.c index c738cdcebb..c91f63c6b2 100644 --- a/orte/mca/filem/rsh/filem_rsh_component.c +++ b/orte/mca/filem/rsh/filem_rsh_component.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -38,6 +38,7 @@ static int filem_rsh_close(void); int orte_filem_rsh_max_incomming = 10; int orte_filem_rsh_max_outgoing = 10; +int orte_filem_rsh_progress_meter = 0; /* * Instantiate the public struct with all of our public information @@ -152,6 +153,14 @@ static int filem_rsh_open(void) orte_filem_rsh_max_outgoing = 1; } + mca_base_param_reg_int(&mca_filem_rsh_component.super.base_version, + "progress_meter", + "Display Progress every X percentage done. [Default = 0/off]", + false, false, + 0, + &orte_filem_rsh_progress_meter); + orte_filem_rsh_progress_meter = (orte_filem_rsh_progress_meter % 101); + /* * Debug Output */ diff --git a/orte/mca/filem/rsh/filem_rsh_module.c b/orte/mca/filem/rsh/filem_rsh_module.c index fd777697b2..bee5377eb6 100644 --- a/orte/mca/filem/rsh/filem_rsh_module.c +++ b/orte/mca/filem/rsh/filem_rsh_module.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -82,7 +82,7 @@ static int orte_filem_rsh_query_remote_path(char **remote_ref, static void filem_rsh_waitpid_cb(pid_t pid, int status, void* cbdata); /* Permission to send functionality */ -static int orte_filem_rsh_permission_listener_init(orte_rml_buffer_callback_fn_t rml_cbfunc); +static int orte_filem_rsh_permission_listener_init(void); static int orte_filem_rsh_permission_listener_cancel(void); static void orte_filem_rsh_permission_callback(int status, orte_process_name_t* sender, @@ -252,8 +252,8 @@ int orte_filem_rsh_module_init(void) /* * Start the listener for permission - */ - if( ORTE_SUCCESS != (ret = orte_filem_rsh_permission_listener_init(orte_filem_rsh_permission_callback) ) ) { + */ + if( ORTE_SUCCESS != (ret = orte_filem_rsh_permission_listener_init())) { opal_output(mca_filem_rsh_component.super.output_handle, "filem:rsh:init Failed to start listener\n"); return ret; @@ -645,6 +645,11 @@ int orte_filem_rsh_wait_all(opal_list_t * request_list) { int ret = ORTE_SUCCESS, exit_status = ORTE_SUCCESS; opal_list_item_t *item = NULL; + double perc_done, last_reported = 0.0; + int total, done; + + total = opal_list_get_size(request_list); + done = 0; for (item = opal_list_get_first( request_list); item != opal_list_get_end( request_list); @@ -657,6 +662,18 @@ int orte_filem_rsh_wait_all(opal_list_t * request_list) exit_status = ret; goto cleanup; } + + /* Progress Meter */ + if( OPAL_UNLIKELY(orte_filem_rsh_progress_meter > 0) ) { + ++done; + perc_done = (total - done) / (1.0 * total); + perc_done = (perc_done-1)*(-100.0); + if( perc_done >= (last_reported + orte_filem_rsh_progress_meter) || last_reported == 0.0 ) { + last_reported = perc_done; + opal_output(0, "filem:rsh: progress: %10.2f %c Finished\n", + perc_done, '%'); + } + } } cleanup: @@ -859,17 +876,6 @@ static int orte_filem_rsh_start_copy(orte_filem_base_request_t *request) { remote_machine, remote_file); } - OPAL_OUTPUT_VERBOSE((17, mca_filem_rsh_component.super.output_handle, - "filem:rsh:put about to execute [%s]", command)); - - if( ORTE_SUCCESS != (ret = orte_filem_rsh_start_command(p_set, - f_set, - command, - request, - cur_index)) ) { - exit_status = ret; - goto cleanup; - } } /* * ow it is the get() routine @@ -892,18 +898,22 @@ static int orte_filem_rsh_start_copy(orte_filem_base_request_t *request) { remote_file, f_set->local_target); } + } - OPAL_OUTPUT_VERBOSE((17, mca_filem_rsh_component.super.output_handle, - "filem:rsh:get about to execute [%s]", command)); - - if( ORTE_SUCCESS != (ret = orte_filem_rsh_start_command(p_set, - f_set, - command, - request, - cur_index)) ) { - exit_status = ret; - goto cleanup; - } + /* + * Start the command + */ + OPAL_OUTPUT_VERBOSE((17, mca_filem_rsh_component.super.output_handle, + "filem:rsh:%s about to execute [%s]", + (request->movement_type == ORTE_FILEM_MOVE_TYPE_PUT ? "put" : "get"), + command)); + if( ORTE_SUCCESS != (ret = orte_filem_rsh_start_command(p_set, + f_set, + command, + request, + cur_index)) ) { + exit_status = ret; + goto cleanup; } continue_set: @@ -1121,6 +1131,8 @@ static int orte_filem_rsh_start_command(orte_filem_base_process_set_t *proc_set /* * Ask for permission to send this file so we do not overwhelm the peer + * Allow only one file request at a time. + * JJH: Look into permission for multiple file permissions at a time */ OPAL_OUTPUT_VERBOSE((10, mca_filem_rsh_component.super.output_handle, "filem:rsh: start_command(): Ask permission to send from proc %s (%d of %d)", @@ -1263,16 +1275,18 @@ static void filem_rsh_waitpid_cb(pid_t pid, int status, void* cbdata) static int orte_filem_rsh_query_remote_path(char **remote_ref, orte_process_name_t *peer, int *flag) { int ret; -#if 0 - /* An optimization if we are guarenteed that this remote files exists. - * Then the 'scp -r' option will work with both files and directories. - * JJH: For general correctness disable this piece of code. + /* + * If we are given an absolute path for the remote side, then there is + * nothing to do. If the remote directory does not exist, then scp will + * error out, which is caught by the filem_rsh_waitpid_cb() function. + * + * Assume the remote path is a directory, since if it is just a file then + * the command will still work as normal. */ if( *remote_ref[0] == '/' ) { *flag = ORTE_FILEM_TYPE_DIR; return ORTE_SUCCESS; } -#endif /* Call the base function */ if( ORTE_SUCCESS != (ret = orte_filem_base_get_remote_path(remote_ref, peer, flag) ) ) { @@ -1285,14 +1299,14 @@ static int orte_filem_rsh_query_remote_path(char **remote_ref, orte_process_name /****************************** * Permission functions ******************************/ -static int orte_filem_rsh_permission_listener_init(orte_rml_buffer_callback_fn_t rml_cbfunc) +static int orte_filem_rsh_permission_listener_init(void) { int ret; if( ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, ORTE_RML_TAG_FILEM_RSH, ORTE_RML_PERSISTENT, - rml_cbfunc, + orte_filem_rsh_permission_callback, NULL)) ) { opal_output(mca_filem_rsh_component.super.output_handle, "filem:rsh: listener_init: Failed to register the receive callback (%d)", @@ -1383,6 +1397,11 @@ static void orte_filem_rsh_permission_callback(int status, } /* Start the transfer immediately */ else { + /* + * Allow only one file request at a time. + * orte_filem_rsh_start_command() only asks for one anyway. + * JJH: Look into permission for multiple file permissions at a time + */ num_allowed = 1; cur_num_incomming += 1; @@ -1403,7 +1422,7 @@ static void orte_filem_rsh_permission_callback(int status, * Receive the allowed transmit amount */ n = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &num_req, &n, OPAL_INT))) { + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &num_allowed, &n, OPAL_INT))) { goto cleanup; } @@ -1412,7 +1431,7 @@ static void orte_filem_rsh_permission_callback(int status, * - Get a pending request directed at this peer * - Start the pending request */ - for(i = 0; i < num_req; ++i ) { + for(i = 0; i < num_allowed; ++i ) { if( 0 >= opal_list_get_size(&work_pool_pending) ) { OPAL_OUTPUT_VERBOSE((10, mca_filem_rsh_component.super.output_handle, "filem:rsh: permission_callback(ALLOW): No more pending sends to peer %s...", @@ -1641,4 +1660,3 @@ static int permission_send_num_allowed(orte_process_name_t* peer, int num_allowe return exit_status; } - diff --git a/orte/mca/odls/base/odls_base_default_fns.c b/orte/mca/odls/base/odls_base_default_fns.c index bab7372641..627db7a117 100644 --- a/orte/mca/odls/base/odls_base_default_fns.c +++ b/orte/mca/odls/base/odls_base_default_fns.c @@ -72,6 +72,8 @@ #if OPAL_ENABLE_FT_CR == 1 #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" #include "opal/mca/crs/crs.h" #include "opal/mca/crs/base/base.h" #endif @@ -1489,7 +1491,14 @@ int orte_odls_base_default_launch_local(orte_jobid_t job, } } } - + +#if OPAL_ENABLE_FT_CR == 1 + for (i=0; i < num_apps; i++) { + orte_sstore.fetch_app_deps(apps[i]); + } + orte_sstore.wait_all_deps(); +#endif + if (ORTE_SUCCESS != (rc = opal_paffinity_base_get_processor_info(&num_processors))) { /* if we cannot find the number of local processors, we have no choice * but to default to conservative settings @@ -1821,7 +1830,7 @@ int orte_odls_base_default_launch_local(orte_jobid_t job, */ if( NULL != opal_crs.crs_prelaunch ) { if( OPAL_SUCCESS != (rc = opal_crs.crs_prelaunch(child->name->vpid, - orte_snapc_base_global_snapshot_loc, + orte_sstore_base_prelaunch_location, &(app->app), &(app->cwd), &(app->argv), @@ -1831,7 +1840,6 @@ int orte_odls_base_default_launch_local(orte_jobid_t job, } } #endif - if (5 < opal_output_get_verbosity(orte_odls_globals.output)) { opal_output(orte_odls_globals.output, "%s odls:launch: spawning child %s", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), @@ -2157,6 +2165,11 @@ int orte_odls_base_default_require_sync(orte_process_name_t *proc, int8_t flag; orte_odls_job_t *jobdat, *jdat; + OPAL_OUTPUT_VERBOSE((5, orte_odls_globals.output, + "%s odls: require sync on child %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(proc))); + /* protect operations involving the global list of children */ OPAL_THREAD_LOCK(&orte_odls_globals.mutex); @@ -2207,10 +2220,18 @@ int orte_odls_base_default_require_sync(orte_process_name_t *proc, free(child->rml_uri); child->rml_uri = NULL; child->fini_recvd = true; + OPAL_OUTPUT_VERBOSE((5, orte_odls_globals.output, + "%s odls: require sync deregistering child %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(child->name))); } else { /* if the contact info is not set, then we are registering the child so * unpack the contact info from the buffer and store it */ + OPAL_OUTPUT_VERBOSE((5, orte_odls_globals.output, + "%s odls: require sync registering child %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(child->name))); child->init_recvd = true; registering = true; cnt = 1; @@ -2508,7 +2529,7 @@ void orte_base_default_waitpid_fired(orte_process_name_t *proc, int32_t status) if(WIFEXITED(status)) { /* set the exit status appropriately */ child->exit_code = WEXITSTATUS(status); - + if (ORTE_PROC_STATE_CALLED_ABORT == child->state) { /* even though the process exited "normally", it happened * via an orte_abort call, so we need to indicate this was @@ -2587,11 +2608,14 @@ void orte_base_default_waitpid_fired(orte_process_name_t *proc, int32_t status) * same way */ child->exit_code = WTERMSIG(status) + 128; - + OPAL_OUTPUT_VERBOSE((5, orte_odls_globals.output, "%s odls:waitpid_fired child process %s terminated with signal", ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), - ORTE_NAME_PRINT(child->name))); + ORTE_NAME_PRINT(child->name) )); + /* JJH: Should we decrement the number of local procs on this node here? + * jobdat->num_local_procs--; + */ } MOVEON: @@ -2822,7 +2846,7 @@ int orte_odls_base_default_kill_local_procs(opal_pointer_array_t *procs, child->waitpid_recvd = true; goto CLEANUP; } - + /* mark the child as "killed" since the waitpid will * fire as soon as we kill it */ diff --git a/orte/mca/plm/base/plm_base_launch_support.c b/orte/mca/plm/base/plm_base_launch_support.c index 99f0488d6b..a71fe04777 100644 --- a/orte/mca/plm/base/plm_base_launch_support.c +++ b/orte/mca/plm/base/plm_base_launch_support.c @@ -248,15 +248,12 @@ int orte_plm_base_setup_job(orte_job_t *jdata) ***/ #if OPAL_ENABLE_FT_CR == 1 - /* JJH: Would it be useful to let the errmgr know what we are doing here? */ /* - * Notify the Global SnapC component regarding new job + * Notify the Global SnapC component regarding new job (even if it was restarted) */ - if (ORTE_JOB_STATE_RESTART != jdata->state) { - if( ORTE_SUCCESS != (rc = orte_snapc.setup_job(jdata->jobid) ) ) { - /* Silent Failure :/ JJH */ - ORTE_ERROR_LOG(rc); - } + if( ORTE_SUCCESS != (rc = orte_snapc.setup_job(jdata->jobid) ) ) { + /* Silent Failure :/ JJH */ + ORTE_ERROR_LOG(rc); } #endif diff --git a/orte/mca/plm/plm_types.h b/orte/mca/plm/plm_types.h index 88c5776ddf..38a5d71c73 100644 --- a/orte/mca/plm/plm_types.h +++ b/orte/mca/plm/plm_types.h @@ -63,6 +63,8 @@ typedef uint32_t orte_proc_state_t; #define ORTE_PROC_STATE_SENSOR_BOUND_EXCEEDED 0x00004000 /* process exceeded a sensor limit */ #define ORTE_PROC_STATE_CALLED_ABORT 0x00008000 /* process called "errmgr.abort" */ #define ORTE_PROC_STATE_HEARTBEAT_FAILED 0x00010000 /* heartbeat failed to arrive */ +#define ORTE_PROC_STATE_MIGRATING 0x00020000 /* process is migrating */ + /* * Job state codes */ diff --git a/orte/mca/rml/rml_types.h b/orte/mca/rml/rml_types.h index 7bcb7784e0..c9e512824e 100644 --- a/orte/mca/rml/rml_types.h +++ b/orte/mca/rml/rml_types.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -183,6 +183,13 @@ ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_msg_packet_t); /* notifier data */ #define ORTE_RML_TAG_NOTIFIER_HNP 40 +/* Process Migration Tool Tag */ +#define ORTE_RML_TAG_MIGRATE 43 + +/* For SStore Framework */ +#define ORTE_RML_TAG_SSTORE 44 +#define ORTE_RML_TAG_SSTORE_INTERNAL 45 + #define ORTE_RML_TAG_MAX 100 diff --git a/orte/mca/snapc/base/base.h b/orte/mca/snapc/base/base.h index c490937e69..3ec98e0111 100644 --- a/orte/mca/snapc/base/base.h +++ b/orte/mca/snapc/base/base.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -53,6 +53,7 @@ typedef uint8_t orte_snapc_cmd_flag_t; #define ORTE_SNAPC_GLOBAL_TERM_CMD 2 #define ORTE_SNAPC_GLOBAL_UPDATE_CMD 3 #define ORTE_SNAPC_LOCAL_UPDATE_CMD 4 +#define ORTE_SNAPC_LOCAL_FINISH_CMD 5 /** * There are 3 types of Coordinators, and any process may be once or more type. @@ -104,6 +105,9 @@ ORTE_DECLSPEC extern orte_snapc_coord_type_t orte_snapc_coord_type; void orte_snapc_base_quiesce_construct(orte_snapc_base_quiesce_t *obj); void orte_snapc_base_quiesce_destruct( orte_snapc_base_quiesce_t *obj); + void orte_snapc_base_request_op_construct(orte_snapc_base_request_op_t *op); + void orte_snapc_base_request_op_destruct(orte_snapc_base_request_op_t *op); + /** * 'None' component functions * These are to be used when no component is selected. @@ -129,40 +133,15 @@ ORTE_DECLSPEC extern orte_snapc_coord_type_t orte_snapc_coord_type; /** * Globals */ -#define orte_snapc_base_metadata_filename (strdup("global_snapshot_meta.data")) - - ORTE_DECLSPEC extern char * orte_snapc_base_global_snapshot_dir; - ORTE_DECLSPEC extern char * orte_snapc_base_global_snapshot_ref; - ORTE_DECLSPEC extern char * orte_snapc_base_global_snapshot_loc; - ORTE_DECLSPEC extern bool orte_snapc_base_store_in_place; ORTE_DECLSPEC extern bool orte_snapc_base_store_only_one_seq; - ORTE_DECLSPEC extern bool orte_snapc_base_establish_global_snapshot_dir; - ORTE_DECLSPEC extern bool orte_snapc_base_is_global_dir_shared; ORTE_DECLSPEC extern size_t orte_snapc_base_snapshot_seq_number; - + ORTE_DECLSPEC extern bool orte_snapc_base_has_recovered; /** * Some utility functions */ ORTE_DECLSPEC int orte_snapc_ckpt_state_str(char ** state_str, int state); - ORTE_DECLSPEC int orte_snapc_base_unique_global_snapshot_name(char **name_str, pid_t pid); - ORTE_DECLSPEC int orte_snapc_base_get_global_snapshot_metadata_file(char **file_name, char *uniq_snapshot_name); - ORTE_DECLSPEC int orte_snapc_base_get_global_snapshot_directory(char **dir_name, char *uniq_global_snapshot_name); - ORTE_DECLSPEC int orte_snapc_base_init_global_snapshot_directory(char *uniq_global_snapshot_name, - bool empty_metadata); - ORTE_DECLSPEC int orte_snapc_base_add_timestamp(char * global_snapshot_ref); - ORTE_DECLSPEC int orte_snapc_base_add_vpid_metadata(orte_process_name_t *proc, - char * global_snapshot_ref, - char *snapshot_ref, - char *snapshot_location, - char *crs_agent); - ORTE_DECLSPEC int orte_snapc_base_finalize_metadata(char * global_snapshot_ref); - ORTE_DECLSPEC int orte_snapc_base_extract_metadata(orte_snapc_base_global_snapshot_t *snapshot); - - ORTE_DECLSPEC int orte_snapc_base_get_all_snapshot_refs(char *base_dir, int *num_refs, char ***snapshot_refs); - ORTE_DECLSPEC int orte_snapc_base_get_all_snapshot_ref_seqs(char *base_dir, char *snapshot_name, int *num_seqs, int **snapshot_ref_seqs); - /******************************* * Global Coordinator functions *******************************/ @@ -172,8 +151,7 @@ ORTE_DECLSPEC extern orte_snapc_coord_type_t orte_snapc_coord_type; opal_crs_base_ckpt_options_t *options, orte_jobid_t *jobid); ORTE_DECLSPEC int orte_snapc_base_global_coord_ckpt_update_cmd(orte_process_name_t* peer, - char *global_snapshot_handle, - int seq_num, + orte_sstore_base_handle_t handle, int ckpt_status); ORTE_DECLSPEC int orte_snapc_base_unpack_options(opal_buffer_t* buffer, diff --git a/orte/mca/snapc/base/snapc_base_close.c b/orte/mca/snapc/base/snapc_base_close.c index 9ae1ec3e17..d18622e3b4 100644 --- a/orte/mca/snapc/base/snapc_base_close.c +++ b/orte/mca/snapc/base/snapc_base_close.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -21,6 +21,8 @@ #include "opal/mca/base/base.h" #include "opal/mca/base/mca_base_param.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" @@ -28,6 +30,11 @@ int orte_snapc_base_close(void) { + /* + * Close on the SStore framework + */ + orte_sstore_base_close(); + /* Close the selected component */ if( NULL != orte_snapc.snapc_finalize ) { orte_snapc.snapc_finalize(); diff --git a/orte/mca/snapc/base/snapc_base_fns.c b/orte/mca/snapc/base/snapc_base_fns.c index 583d49baf8..0e5c945bb3 100644 --- a/orte/mca/snapc/base/snapc_base_fns.c +++ b/orte/mca/snapc/base/snapc_base_fns.c @@ -57,25 +57,15 @@ #include "orte/runtime/orte_globals.h" #include "orte/util/name_fns.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" /****************** * Local Functions ******************/ -/* Some local strings to use genericly with the global metadata file */ -#define SNAPC_METADATA_SEQ ("# Seq: ") -#define SNAPC_METADATA_DONE_SEQ ("# Finished Seq: ") -#define SNAPC_METADATA_TIME ("# Timestamp: ") -#define SNAPC_METADATA_PROCESS ("# Process: ") -#define SNAPC_METADATA_CRS_COMP ("# OPAL CRS Component: ") -#define SNAPC_METADATA_SNAP_REF ("# Snapshot Reference: ") -#define SNAPC_METADATA_SNAP_LOC ("# Snapshot Location: ") - -static int get_next_seq_number(FILE *file); -static int get_next_valid_seq_number(FILE *file); -static int metadata_extract_next_token(FILE *file, char **token, char **value); - size_t orte_snapc_base_snapshot_seq_number = 0; /****************** @@ -93,11 +83,7 @@ void orte_snapc_base_local_snapshot_construct(orte_snapc_base_local_snapshot_t * snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; - snapshot->reference_name = NULL; - snapshot->local_location = NULL; - snapshot->remote_location = NULL; - - snapshot->opal_crs = NULL; + snapshot->ss_handle = ORTE_SSTORE_HANDLE_INVALID; } void orte_snapc_base_local_snapshot_destruct( orte_snapc_base_local_snapshot_t *snapshot) @@ -107,25 +93,7 @@ void orte_snapc_base_local_snapshot_destruct( orte_snapc_base_local_snapshot_t * snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; - if( NULL != snapshot->reference_name ) { - free(snapshot->reference_name); - snapshot->reference_name = NULL; - } - - if( NULL != snapshot->local_location ) { - free(snapshot->local_location); - snapshot->local_location = NULL; - } - - if( NULL != snapshot->remote_location ) { - free(snapshot->remote_location); - snapshot->remote_location = NULL; - } - - if( NULL != snapshot->opal_crs ) { - free(snapshot->opal_crs); - snapshot->opal_crs = NULL; - } + snapshot->ss_handle = ORTE_SSTORE_HANDLE_INVALID; } /****/ @@ -136,20 +104,11 @@ OBJ_CLASS_INSTANCE(orte_snapc_base_global_snapshot_t, void orte_snapc_base_global_snapshot_construct(orte_snapc_base_global_snapshot_t *snapshot) { - char *tmp_dir = NULL; - OBJ_CONSTRUCT(&(snapshot->local_snapshots), opal_list_t); - orte_snapc_base_unique_global_snapshot_name(&(snapshot->reference_name), getpid()); + snapshot->options = OBJ_NEW(opal_crs_base_ckpt_options_t); - orte_snapc_base_get_global_snapshot_directory(&tmp_dir, snapshot->reference_name); - snapshot->local_location = opal_dirname(tmp_dir); - free(tmp_dir); - - snapshot->seq_num = 0; - - snapshot->start_time = NULL; - snapshot->end_time = NULL; + snapshot->ss_handle = ORTE_SSTORE_HANDLE_INVALID; } void orte_snapc_base_global_snapshot_destruct( orte_snapc_base_global_snapshot_t *snapshot) @@ -161,27 +120,12 @@ void orte_snapc_base_global_snapshot_destruct( orte_snapc_base_global_snapshot_t } OBJ_DESTRUCT(&(snapshot->local_snapshots)); - if(NULL != snapshot->reference_name) { - free(snapshot->reference_name); - snapshot->reference_name = NULL; + if( NULL != snapshot->options ) { + OBJ_RELEASE(snapshot->options); + snapshot->options = NULL; } - if(NULL != snapshot->local_location) { - free(snapshot->local_location); - snapshot->local_location = NULL; - } - - if(NULL != snapshot->start_time) { - free(snapshot->start_time); - snapshot->start_time = NULL; - } - - if(NULL != snapshot->end_time) { - free(snapshot->end_time); - snapshot->end_time = NULL; - } - - snapshot->seq_num = 0; + snapshot->ss_handle = ORTE_SSTORE_HANDLE_INVALID; } OBJ_CLASS_INSTANCE(orte_snapc_base_quiesce_t, @@ -193,6 +137,8 @@ void orte_snapc_base_quiesce_construct(orte_snapc_base_quiesce_t *quiesce) { quiesce->epoch = -1; quiesce->snapshot = NULL; + quiesce->ss_handle = ORTE_SSTORE_HANDLE_INVALID; + quiesce->ss_snapshot = NULL; quiesce->handle = NULL; quiesce->target_dir = NULL; quiesce->crs_name = NULL; @@ -201,10 +147,17 @@ void orte_snapc_base_quiesce_construct(orte_snapc_base_quiesce_t *quiesce) quiesce->checkpointing = false; quiesce->restarting = false; + quiesce->migrating = false; + quiesce->num_migrating = 0; + OBJ_CONSTRUCT(&(quiesce->migrating_procs), opal_pointer_array_t); + opal_pointer_array_init(&(quiesce->migrating_procs), 8, INT32_MAX, 8); } void orte_snapc_base_quiesce_destruct( orte_snapc_base_quiesce_t *quiesce) { + int i; + void *item = NULL; + quiesce->epoch = -1; if( NULL != quiesce->snapshot ) { @@ -212,6 +165,12 @@ void orte_snapc_base_quiesce_destruct( orte_snapc_base_quiesce_t *quiesce) quiesce->snapshot = NULL; } + quiesce->ss_handle = ORTE_SSTORE_HANDLE_INVALID; + if( NULL != quiesce->ss_snapshot ) { + OBJ_RELEASE(quiesce->ss_snapshot); + quiesce->ss_snapshot = NULL; + } + if( NULL != quiesce->handle ) { free(quiesce->handle); quiesce->handle = NULL; @@ -233,6 +192,75 @@ void orte_snapc_base_quiesce_destruct( orte_snapc_base_quiesce_t *quiesce) quiesce->checkpointing = false; quiesce->restarting = false; + quiesce->migrating = false; + quiesce->num_migrating = 0; + for( i = 0; i < quiesce->migrating_procs.size; ++i) { + item = opal_pointer_array_get_item(&(quiesce->migrating_procs), i); + if( NULL != item ) { + OBJ_RELEASE(item); + } + } + OBJ_DESTRUCT(&(quiesce->migrating_procs)); +} + +OBJ_CLASS_INSTANCE(orte_snapc_base_request_op_t, + opal_object_t, + orte_snapc_base_request_op_construct, + orte_snapc_base_request_op_destruct); + +void orte_snapc_base_request_op_construct(orte_snapc_base_request_op_t *op) +{ + op->event = ORTE_SNAPC_OP_NONE; + op->is_active = false; + op->leader = -1; + + op->seq_num = -1; + op->global_handle = NULL; + op->ss_handle = ORTE_SSTORE_HANDLE_INVALID; + + op->mig_num = -1; + op->mig_vpids = NULL; + /*op->mig_host_pref = NULL;*/ + op->mig_vpid_pref = NULL; + op->mig_off_node = NULL; +} + +void orte_snapc_base_request_op_destruct( orte_snapc_base_request_op_t *op) +{ + op->event = ORTE_SNAPC_OP_NONE; + op->is_active = false; + op->leader = -1; + + op->seq_num = -1; + if(NULL != op->global_handle ) { + free(op->global_handle); + op->global_handle = NULL; + } + + op->ss_handle = ORTE_SSTORE_HANDLE_INVALID; + + op->mig_num = -1; + /* + if( NULL != op->mig_vpids ) { + free( op->mig_vpids ); + op->mig_vpids = NULL; + } + + if( NULL != op->mig_host_pref ) { + free( op->mig_host_pref ); + op->mig_host_pref = NULL; + } + + if( NULL != op->mig_vpid_pref ) { + free( op->mig_vpid_pref ); + op->mig_vpid_pref = NULL; + } + + if( NULL != op->mig_off_node ) { + free( op->mig_off_node ); + op->mig_off_node = NULL; + } + */ } @@ -363,7 +391,7 @@ static void snapc_none_global_cmdline_request(int status, /* * Respond with an invalid response */ - if( ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(sender, NULL, -1, ORTE_SNAPC_CKPT_STATE_NO_CKPT)) ) { + if( ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(sender, 0, ORTE_SNAPC_CKPT_STATE_NO_CKPT)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -473,7 +501,49 @@ int orte_snapc_base_unpack_options(opal_buffer_t* buffer, exit_status = ret; goto cleanup; } - + + count = 1; + if ( ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(options->inc_prep_only), &count, OPAL_BOOL)) ) { + opal_output(orte_snapc_base_output, + "snapc:base:unpack_options: Error: Unpack (inc_prep_only) Failure (ret = %d)\n", + ret); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if ( ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(options->inc_recover_only), &count, OPAL_BOOL)) ) { + opal_output(orte_snapc_base_output, + "snapc:base:unpack_options: Error: Unpack (inc_recover_only) Failure (ret = %d)\n", + ret); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + +#if OPAL_ENABLE_CRDEBUG == 1 + count = 1; + if ( ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(options->attach_debugger), &count, OPAL_BOOL)) ) { + opal_output(orte_snapc_base_output, + "snapc:base:unpack_options: Error: Unpack (attach_debugger) Failure (ret = %d)\n", + ret); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if ( ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(options->detach_debugger), &count, OPAL_BOOL)) ) { + opal_output(orte_snapc_base_output, + "snapc:base:unpack_options: Error: Unpack (detach_debugger) Failure (ret = %d)\n", + ret); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } +#endif + cleanup: return exit_status; } @@ -495,18 +565,46 @@ int orte_snapc_base_pack_options(opal_buffer_t* buffer, goto cleanup; } + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(options->inc_prep_only), 1, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(options->inc_recover_only), 1, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + +#if OPAL_ENABLE_CRDEBUG == 1 + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(options->attach_debugger), 1, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(options->detach_debugger), 1, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } +#endif + cleanup: return exit_status; } int orte_snapc_base_global_coord_ckpt_update_cmd(orte_process_name_t* peer, - char *global_snapshot_handle, - int seq_num, + orte_sstore_base_handle_t ss_handle, int ckpt_status) { int ret, exit_status = ORTE_SUCCESS; opal_buffer_t *loc_buffer = NULL; orte_snapc_cmd_flag_t command = ORTE_SNAPC_GLOBAL_UPDATE_CMD; + char *global_snapshot_handle = NULL; + char *tmp_str = NULL; + int seq_num; /* * Noop if invalid peer, or peer not specified (JJH Double check this) @@ -529,9 +627,9 @@ int orte_snapc_base_global_coord_ckpt_update_cmd(orte_process_name_t* peer, } OPAL_OUTPUT_VERBOSE((10, orte_snapc_base_output, - "%s) base:ckpt_update_cmd: Sending update command <%s> \n", + "%s) base:ckpt_update_cmd: Sending update command \n", ORTE_SNAPC_COORD_NAME_PRINT(orte_snapc_coord_type), - global_snapshot_handle, seq_num, ckpt_status)); + ckpt_status)); /******************** * Send over the status of the checkpoint @@ -560,9 +658,24 @@ int orte_snapc_base_global_coord_ckpt_update_cmd(orte_process_name_t* peer, goto cleanup; } - if( ORTE_SNAPC_CKPT_STATE_FINISHED == ckpt_status || - ORTE_SNAPC_CKPT_STATE_STOPPED == ckpt_status || - ORTE_SNAPC_CKPT_STATE_ERROR == ckpt_status ) { + if( ORTE_SNAPC_CKPT_STATE_RECOVERED == ckpt_status || + ORTE_SNAPC_CKPT_STATE_ESTABLISHED == ckpt_status || + ORTE_SNAPC_CKPT_STATE_STOPPED == ckpt_status || + ORTE_SNAPC_CKPT_STATE_ERROR == ckpt_status ) { + orte_sstore.get_attr(ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_REF, + &global_snapshot_handle); + + orte_sstore.get_attr(ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_SEQ, + &tmp_str); + seq_num = atoi(tmp_str); + + OPAL_OUTPUT_VERBOSE((10, orte_snapc_base_output, + "%s) base:ckpt_update_cmd: Sending update command + \n", + ORTE_SNAPC_COORD_NAME_PRINT(orte_snapc_coord_type), + ckpt_status, global_snapshot_handle, seq_num)); + if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &global_snapshot_handle, 1, OPAL_STRING))) { opal_output(orte_snapc_base_output, "%s) base:ckpt_update_cmd: Error: DSS Pack (snapshot handle) Failure (ret = %d) (LINE = %d)\n", @@ -572,6 +685,7 @@ int orte_snapc_base_global_coord_ckpt_update_cmd(orte_process_name_t* peer, exit_status = ret; goto cleanup; } + if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &seq_num, 1, OPAL_INT))) { opal_output(orte_snapc_base_output, "%s) base:ckpt_update_cmd: Error: DSS Pack (seq number) Failure (ret = %d) (LINE = %d)\n", @@ -593,12 +707,19 @@ int orte_snapc_base_global_coord_ckpt_update_cmd(orte_process_name_t* peer, goto cleanup; } - cleanup: if(NULL != loc_buffer) { OBJ_RELEASE(loc_buffer); loc_buffer = NULL; } + if( NULL != global_snapshot_handle ){ + free(global_snapshot_handle); + global_snapshot_handle = NULL; + } + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } return exit_status; } @@ -611,682 +732,6 @@ int orte_snapc_base_global_coord_ckpt_update_cmd(orte_process_name_t* peer, /***************************** * Snapshot metadata functions *****************************/ -int orte_snapc_base_get_all_snapshot_refs(char *base_dir, int *num_refs, char ***snapshot_refs) -{ -#ifndef HAVE_DIRENT_H - return ORTE_ERR_NOT_SUPPORTED; -#else - int ret, exit_status = ORTE_SUCCESS; - char * tmp_str = NULL, * metadata_file = NULL; - DIR *dirp = NULL; - struct dirent *dir_entp = NULL; - struct stat file_status; - - if( NULL == base_dir ) { - if( NULL == orte_snapc_base_global_snapshot_dir ) { - exit_status = ORTE_ERROR; - goto cleanup; - } - base_dir = strdup(orte_snapc_base_global_snapshot_dir); - } - - /* - * Get all subdirectories under the base directory - */ - dirp = opendir(base_dir); - while( NULL != (dir_entp = readdir(dirp))) { - /* Skip "." and ".." if they are in the list */ - if( 0 == strncmp("..", dir_entp->d_name, strlen("..") ) || - 0 == strncmp(".", dir_entp->d_name, strlen(".") ) ) { - continue; - } - - /* Add the full path */ - asprintf(&tmp_str, "%s/%s", base_dir, dir_entp->d_name); - if(0 != (ret = stat(tmp_str, &file_status) ) ){ - free( tmp_str); - tmp_str = NULL; - continue; - } else { - /* Is it a directory? */ - if(S_ISDIR(file_status.st_mode) ) { - asprintf(&metadata_file, "%s/%s", - tmp_str, - orte_snapc_base_metadata_filename); - if(0 != (ret = stat(metadata_file, &file_status) ) ){ - free( tmp_str); - tmp_str = NULL; - free( metadata_file); - metadata_file = NULL; - continue; - } else { - if(S_ISREG(file_status.st_mode) ) { - opal_argv_append(num_refs, snapshot_refs, dir_entp->d_name); - } - } - free( metadata_file); - metadata_file = NULL; - } - } - - free( tmp_str); - tmp_str = NULL; - } - - closedir(dirp); - - cleanup: - if( NULL != tmp_str) { - free( tmp_str); - tmp_str = NULL; - } - - return exit_status; -#endif /* HAVE_DIRENT_H */ -} - -int orte_snapc_base_get_all_snapshot_ref_seqs(char *base_dir, char *snapshot_name, int *num_seqs, int **snapshot_ref_seqs) -{ - int exit_status = ORTE_SUCCESS; - char * metadata_file = NULL; - FILE * meta_data = NULL; - int s, next_seq_int; - - if( NULL == base_dir ) { - if( NULL == orte_snapc_base_global_snapshot_dir ) { - exit_status = ORTE_ERROR; - goto cleanup; - } - base_dir = strdup(orte_snapc_base_global_snapshot_dir); - } - - asprintf(&metadata_file, "%s/%s/%s", - base_dir, - snapshot_name, - orte_snapc_base_metadata_filename); - - - if (NULL == (meta_data = fopen(metadata_file, "r")) ) { - opal_output(0, "Error: Unable to open the file <%s>\n", metadata_file); - exit_status = ORTE_ERROR; - goto cleanup; - } - - /* First pass to count the number of sequence numbers */ - *num_seqs = 0; - while(0 <= (next_seq_int = get_next_valid_seq_number(meta_data)) ){ - *num_seqs += 1; - } - - /* If there are no valid seq numbers then just return here */ - if( 0 == *num_seqs ) { - exit_status = ORTE_SUCCESS; - goto cleanup; - } - - rewind(meta_data); - - /* Second pass to add them to the list */ - (*snapshot_ref_seqs) = (int *) malloc(sizeof(int) * (*num_seqs)); - s = 0; - while(0 <= (next_seq_int = get_next_valid_seq_number(meta_data)) ){ - (*snapshot_ref_seqs)[s] = next_seq_int; - ++s; - } - - cleanup: - if(NULL != meta_data) { - fclose(meta_data); - meta_data = NULL; - } - if(NULL != metadata_file) { - free(metadata_file); - metadata_file = NULL; - } - - return exit_status; -} - -int orte_snapc_base_unique_global_snapshot_name(char **name_str, pid_t pid) -{ - if( NULL == orte_snapc_base_global_snapshot_ref ) { - asprintf(name_str, "ompi_global_snapshot_%d.ckpt", pid); - } - else { - *name_str = strdup(orte_snapc_base_global_snapshot_ref); - } - - return ORTE_SUCCESS; -} - -int orte_snapc_base_get_global_snapshot_metadata_file(char **file_name, char *uniq_snapshot_name) -{ - asprintf(file_name, "%s/%s/%s", - orte_snapc_base_global_snapshot_dir, - uniq_snapshot_name, - orte_snapc_base_metadata_filename); - - return ORTE_SUCCESS; -} - -int orte_snapc_base_get_global_snapshot_directory(char **dir_name, char *uniq_snapshot_name) -{ - asprintf(dir_name, "%s/%s/%d", - orte_snapc_base_global_snapshot_dir, - uniq_snapshot_name, - (int)orte_snapc_base_snapshot_seq_number); - - return ORTE_SUCCESS; -} - -int orte_snapc_base_init_global_snapshot_directory(char *uniq_global_snapshot_name, bool empty_metadata) -{ - char * dir_name = NULL, *meta_data_fname = NULL; - mode_t my_mode = S_IRWXU; - int ret; - int exit_status = ORTE_SUCCESS; - FILE * meta_data = NULL; - - /* - * Make the snapshot directory from the uniq_global_snapshot_name - */ - orte_snapc_base_get_global_snapshot_directory(&dir_name, uniq_global_snapshot_name); - if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(dir_name, my_mode)) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - /* - * Initialize the metadata file at the top of that directory. - */ - orte_snapc_base_get_global_snapshot_metadata_file(&meta_data_fname, uniq_global_snapshot_name); - - if (NULL == (meta_data = fopen(meta_data_fname, "a")) ) { - opal_output(orte_snapc_base_output, - "%s) base:init_global_snapshot_directory: Error: Unable to open the file (%s)\n", - ORTE_SNAPC_COORD_NAME_PRINT(orte_snapc_coord_type), - meta_data_fname); - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - - /* - * Put in the checkpoint sequence number - */ - if( empty_metadata ) { - fprintf(meta_data, "#\n"); - } - else { - /* - * Put in the checkpoint sequence number - */ - fprintf(meta_data, "#\n%s%d\n", SNAPC_METADATA_SEQ, (int)orte_snapc_base_snapshot_seq_number); - - fclose(meta_data); - meta_data = NULL; - - /* Add timestamp */ - orte_snapc_base_add_timestamp(uniq_global_snapshot_name); - } - - cleanup: - if(NULL != meta_data) - fclose(meta_data); - if(NULL != dir_name) - free(dir_name); - if(NULL != meta_data_fname) - free(meta_data_fname); - - return ORTE_SUCCESS; -} - -/* - * Metadata file handling functions - * File is of the form: - * - * # - * # Checkpoint Sequence # - * # Begin Timestamp - * # Process ID - * # OPAL CRS - * opal_restart ----mca crs_base_snapshot_dir SNAPSHOT_LOC SNAPSHOT_REF - * ... - * # End Timestamp - * - * E.g., - # - # Seq: 0 - # Timestamp: Mon Jun 5 18:32:08 2006 - # Process: 0.1.0 - # OPAL CRS Component: blcr - opal_restart --mca crs_base_snapshot_dir /tmp/ompi_global_snapshot_32535.ckpt/0 opal_snapshot_0.ckpt - # Process: 0.1.1 - # OPAL CRS Component: blcr - opal_restart --mca crs_base_snapshot_dir /tmp/ompi_global_snapshot_32535.ckpt/0 opal_snapshot_1.ckpt - # Timestamp: Mon Jun 5 18:32:10 2006 - # - # Seq: 1 - # Timestamp: Mon Jun 5 18:32:12 2006 - # Process: 0.1.0 - # OPAL CRS Component: blcr - opal_restart --mca crs_base_snapshot_dir /tmp/ompi_global_snapshot_32535.ckpt/1 opal_snapshot_0.ckpt - # Process: 0.1.1 - # OPAL CRS Component: blcr - opal_restart --mca crs_base_snapshot_dir /tmp/ompi_global_snapshot_32535.ckpt/1 opal_snapshot_1.ckpt - # Timestamp: Mon Jun 5 18:32:13 2006 - * - */ -int orte_snapc_base_add_timestamp(char * global_snapshot_ref) -{ - int exit_status = ORTE_SUCCESS; - FILE * meta_data = NULL; - char * meta_data_fname = NULL; - time_t timestamp; - - orte_snapc_base_get_global_snapshot_metadata_file(&meta_data_fname, global_snapshot_ref); - - if (NULL == (meta_data = fopen(meta_data_fname, "a")) ) { - opal_output(orte_snapc_base_output, - "%s) base:add_timestamp: Error: Unable to open the file (%s)\n", - ORTE_SNAPC_COORD_NAME_PRINT(orte_snapc_coord_type), - meta_data_fname); - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - - timestamp = time(NULL); - fprintf(meta_data, "%s%s", SNAPC_METADATA_TIME, ctime(×tamp)); - - cleanup: - if( NULL != meta_data ) - fclose(meta_data); - if( NULL != meta_data_fname) - free(meta_data_fname); - - return exit_status; -} - -int orte_snapc_base_finalize_metadata(char * global_snapshot_ref) -{ - int exit_status = ORTE_SUCCESS; - FILE * meta_data = NULL; - char * meta_data_fname = NULL; - - /* Add the final timestamp */ - orte_snapc_base_add_timestamp(global_snapshot_ref); - - orte_snapc_base_get_global_snapshot_metadata_file(&meta_data_fname, global_snapshot_ref); - - if (NULL == (meta_data = fopen(meta_data_fname, "a")) ) { - opal_output(orte_snapc_base_output, - "%s) base:add_timestamp: Error: Unable to open the file (%s)\n", - ORTE_SNAPC_COORD_NAME_PRINT(orte_snapc_coord_type), - meta_data_fname); - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - - fprintf(meta_data, "%s%d\n", SNAPC_METADATA_DONE_SEQ, (int)orte_snapc_base_snapshot_seq_number); - - cleanup: - if( NULL != meta_data ) - fclose(meta_data); - if( NULL != meta_data_fname) - free(meta_data_fname); - - return exit_status; -} - - -int orte_snapc_base_add_vpid_metadata( orte_process_name_t *proc, - char * global_snapshot_ref, - char *snapshot_ref, - char *snapshot_location, - char *crs_agent) -{ - int ret, exit_status = ORTE_SUCCESS; - FILE * meta_data = NULL; - char * meta_data_fname = NULL; - char * crs_comp = NULL; - char * proc_name = NULL; - char * local_snapshot = NULL; - int prev_pid = 0; - - if( NULL == snapshot_location ) { - return ORTE_ERROR; - } - - orte_snapc_base_get_global_snapshot_metadata_file(&meta_data_fname, global_snapshot_ref); - - if (NULL == (meta_data = fopen(meta_data_fname, "a")) ) { - opal_output(orte_snapc_base_output, - "%s) base:add_metadata: Error: Unable to open the file (%s)\n", - ORTE_SNAPC_COORD_NAME_PRINT(orte_snapc_coord_type), - meta_data_fname); - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - - /* - * Something of the form: - * 0.1.0 opal_snapshot_0.ckpt /tmp/ompi_global_snapshot_8827.ckpt/1/opal_snapshot_0.ckpt BLCR - * or better yet start to create the proper app schema: - * orte_restart --mca crs_base_snapshot_dir /tmp/ompi_global_snapshot_8827.ckpt/1 opal_snapshot_0.ckpt - */ - orte_util_convert_process_name_to_string(&proc_name, proc); - - /* Extract the checkpointer */ - if( NULL == crs_agent ) { - asprintf(&local_snapshot, "%s/%s", snapshot_location, snapshot_ref); - if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(local_snapshot, &crs_comp, &prev_pid)) ) { - opal_show_help("help-orte-snapc-base.txt", "invalid_metadata", true, - proc_name, opal_crs_base_metadata_filename, local_snapshot); - exit_status = ret; - goto cleanup; - } - } else { - crs_comp = strdup(crs_agent); - } - - /* Write the string */ - fprintf(meta_data, "%s%s\n", SNAPC_METADATA_PROCESS, proc_name); - fprintf(meta_data, "%s%s\n", SNAPC_METADATA_CRS_COMP, crs_comp); - fprintf(meta_data, "%s%s\n", SNAPC_METADATA_SNAP_REF, snapshot_ref); - fprintf(meta_data, "%s%s\n", SNAPC_METADATA_SNAP_LOC, snapshot_location); - - cleanup: - if( NULL != meta_data ) { - fclose(meta_data); - meta_data = NULL; - } - if( NULL != meta_data_fname) { - free(meta_data_fname); - meta_data_fname = NULL; - } - if( NULL != local_snapshot ) { - free( local_snapshot ); - local_snapshot = NULL; - } - - return exit_status; -} - -int orte_snapc_base_extract_metadata(orte_snapc_base_global_snapshot_t *global_snapshot) -{ - int exit_status = ORTE_SUCCESS; - FILE * meta_data = NULL; - char * meta_data_fname = NULL; - int next_seq_int; - char * token = NULL; - char * value = NULL; - orte_snapc_base_local_snapshot_t *vpid_snapshot = NULL; - - /* - * Open the metadata file - */ - orte_snapc_base_get_global_snapshot_metadata_file(&meta_data_fname, global_snapshot->reference_name); - if (NULL == (meta_data = fopen(meta_data_fname, "r")) ) { - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - - /* - * If we were not given a sequence number, first find the largest valid seq number - */ - if(0 > global_snapshot->seq_num ) { - while(0 <= (next_seq_int = get_next_valid_seq_number(meta_data)) ){ - global_snapshot->seq_num = next_seq_int; - } - rewind(meta_data); - } - - /* - * Find the requested sequence number, - */ - while( global_snapshot->seq_num != (next_seq_int = get_next_seq_number(meta_data)) ) { - /* We didn't find the requested seq */ - if(0 > next_seq_int) { - exit_status = ORTE_ERROR; - goto cleanup; - } - } - - /* - * Extract each token and make the records - */ - do { - if( ORTE_SUCCESS != metadata_extract_next_token(meta_data, &token, &value) ) { - break; - } - - if(0 == strncmp(SNAPC_METADATA_SEQ, token, strlen(SNAPC_METADATA_SEQ)) ) { - break; - } - else if(0 == strncmp(SNAPC_METADATA_TIME, token, strlen(SNAPC_METADATA_TIME)) ) { - if( NULL == global_snapshot->start_time) { - global_snapshot->start_time = strdup(value); - } - else { - global_snapshot->end_time = strdup(value); - } - } - else if(0 == strncmp(SNAPC_METADATA_PROCESS, token, strlen(SNAPC_METADATA_PROCESS)) ) { - orte_process_name_t proc; - - orte_util_convert_string_to_process_name(&proc, value); - - /* Not the first process, so append it to the list */ - if( NULL != vpid_snapshot) { - opal_list_append(&global_snapshot->local_snapshots, &(vpid_snapshot->super)); - } - - vpid_snapshot = OBJ_NEW(orte_snapc_base_local_snapshot_t); - - vpid_snapshot->process_name.jobid = proc.jobid; - vpid_snapshot->process_name.vpid = proc.vpid; - } - else if(0 == strncmp(SNAPC_METADATA_CRS_COMP, token, strlen(SNAPC_METADATA_CRS_COMP)) ) { - vpid_snapshot->opal_crs = strdup(value); - } - else if(0 == strncmp(SNAPC_METADATA_SNAP_REF, token, strlen(SNAPC_METADATA_SNAP_REF)) ) { - vpid_snapshot->reference_name = strdup(value); - } - else if(0 == strncmp(SNAPC_METADATA_SNAP_LOC, token, strlen(SNAPC_METADATA_SNAP_LOC)) ) { - vpid_snapshot->local_location = strdup(value); - vpid_snapshot->remote_location = strdup(value); - } - } while(0 == feof(meta_data) ); - - /* Append the last item */ - if( NULL != vpid_snapshot) { - opal_list_append(&global_snapshot->local_snapshots, &(vpid_snapshot->super)); - } - - cleanup: - if(NULL != meta_data) - fclose(meta_data); - if(NULL != meta_data_fname) - free(meta_data_fname); - - return exit_status; -} - -/* - * Extract the next sequence number from the file - */ -static int get_next_seq_number(FILE *file) -{ - char *token = NULL; - char *value = NULL; - int seq_int = -1; - - do { - if( ORTE_SUCCESS != metadata_extract_next_token(file, &token, &value) ) { - seq_int = -1; - goto cleanup; - } - } while(0 != strncmp(token, SNAPC_METADATA_SEQ, strlen(SNAPC_METADATA_SEQ)) ); - - seq_int = atoi(value); - - cleanup: - if( NULL != token) - free(token); - if( NULL != value) - free(value); - - return seq_int; -} - -/* - * Extract the next Valid sequence number from the file - */ -static int get_next_valid_seq_number(FILE *file) -{ - char *token = NULL; - char *value = NULL; - int seq_int = -1; - - do { - if( ORTE_SUCCESS != metadata_extract_next_token(file, &token, &value) ) { - seq_int = -1; - goto cleanup; - } - } while(0 != strncmp(token, SNAPC_METADATA_DONE_SEQ, strlen(SNAPC_METADATA_DONE_SEQ)) ); - - seq_int = atoi(value); - - cleanup: - if( NULL != token) - free(token); - if( NULL != value) - free(value); - - return seq_int; -} - -static int metadata_extract_next_token(FILE *file, char **token, char **value) -{ - int exit_status = ORTE_SUCCESS; - int max_len = 256; - char * line = NULL; - int line_len = 0; - int c = 0, s = 0, v = 0; - char *local_token = NULL; - char *local_value = NULL; - bool end_of_line = false; - - line = (char *) malloc(sizeof(char) * max_len); - - try_again: - /* - * If we are at the end of the file, then just return - */ - if(0 != feof(file) ) { - exit_status = ORTE_ERROR; - goto cleanup; - } - - /* - * Other wise grab the next token/value pair - */ - if (NULL == fgets(line, max_len, file) ) { - exit_status = ORTE_ERROR; - goto cleanup; - } - line_len = strlen(line); - /* Strip off the new line if it it there */ - if('\n' == line[line_len-1]) { - line[line_len-1] = '\0'; - line_len--; - end_of_line = true; - } - else { - end_of_line = false; - } - - /* Ignore lines with just '#' too */ - if(2 >= line_len) - goto try_again; - - /* - * Extract the token from the set - */ - for(c = 0; - line[c] != ':' && - c < line_len; - ++c) { - ; - } - c += 2; /* For the ' ' and the '\0' */ - local_token = (char *)malloc(sizeof(char) * (c + 1)); - - for(s = 0; s < c; ++s) { - local_token[s] = line[s]; - } - - local_token[s] = '\0'; - *token = strdup(local_token); - - if( NULL != local_token) { - free(local_token); - local_token = NULL; - } - - /* - * Extract the value from the set - */ - local_value = (char *)malloc(sizeof(char) * (line_len - c + 1)); - for(v = 0, s = c; - s < line_len; - ++s, ++v) { - local_value[v] = line[s]; - } - - while(!end_of_line) { - if (NULL == fgets(line, max_len, file) ) { - exit_status = ORTE_ERROR; - goto cleanup; - } - line_len = strlen(line); - /* Strip off the new line if it it there */ - if('\n' == line[line_len-1]) { - line[line_len-1] = '\0'; - line_len--; - end_of_line = true; - } - else { - end_of_line = false; - } - - local_value = (char *)realloc(local_value, sizeof(char) * line_len); - for(s = 0; - s < line_len; - ++s, ++v) { - local_value[v] = line[s]; - } - } - - local_value[v] = '\0'; - *value = strdup(local_value); - - cleanup: - if( NULL != local_token) - free(local_token); - if( NULL != local_value) - free(local_value); - if( NULL != line) - free(line); - - return exit_status; -} - int orte_snapc_ckpt_state_str(char ** state_str, int state) { switch(state) { @@ -1305,11 +750,14 @@ int orte_snapc_ckpt_state_str(char ** state_str, int state) case ORTE_SNAPC_CKPT_STATE_STOPPED: *state_str = strdup("Stopped"); break; - case ORTE_SNAPC_CKPT_STATE_FILE_XFER: - *state_str = strdup("File Transfer"); + case ORTE_SNAPC_CKPT_STATE_MIGRATING: + *state_str = strdup("Migrating"); break; - case ORTE_SNAPC_CKPT_STATE_FINISHED: - *state_str = strdup("Finished"); + case ORTE_SNAPC_CKPT_STATE_ESTABLISHED: + *state_str = strdup("Checkpoint Established"); + break; + case ORTE_SNAPC_CKPT_STATE_RECOVERED: + *state_str = strdup("Continuing/Recovered"); break; case ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL: *state_str = strdup("Locally Finished"); diff --git a/orte/mca/snapc/base/snapc_base_open.c b/orte/mca/snapc/base/snapc_base_open.c index f9cb3dcaf1..4cc40e5204 100644 --- a/orte/mca/snapc/base/snapc_base_open.c +++ b/orte/mca/snapc/base/snapc_base_open.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2008 The Trustees of the University of Tennessee. * All rights reserved. @@ -35,6 +35,9 @@ #include "opal/util/output.h" #include "orte/constants.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" @@ -68,13 +71,8 @@ opal_list_t orte_snapc_base_components_available; orte_snapc_base_component_t orte_snapc_base_selected_component; orte_snapc_coord_type_t orte_snapc_coord_type = ORTE_SNAPC_UNASSIGN_TYPE; -char * orte_snapc_base_global_snapshot_dir = NULL; -char * orte_snapc_base_global_snapshot_loc = NULL; -char * orte_snapc_base_global_snapshot_ref = NULL; -bool orte_snapc_base_store_in_place = true; bool orte_snapc_base_store_only_one_seq = false; -bool orte_snapc_base_establish_global_snapshot_dir = false; -bool orte_snapc_base_is_global_dir_shared = false; +bool orte_snapc_base_has_recovered = false; /** * Function for finding and opening either all MCA components, @@ -90,48 +88,6 @@ int orte_snapc_base_open(void) orte_snapc_base_output = opal_output_open(NULL); - /* Global Snapshot directory */ - mca_base_param_reg_string_name("snapc", - "base_global_snapshot_dir", - "The base directory to use when storing global snapshots", - false, false, - opal_home_directory(), - &orte_snapc_base_global_snapshot_dir); - - mca_base_param_reg_int_name("snapc", - "base_global_shared", - "If the global_snapshot_dir is on a shared file system all nodes can access, " - "then the checkpoint files can be copied more efficiently when FileM is used." - " [Default = disabled]", - false, false, - 0, - &value); - orte_snapc_base_is_global_dir_shared = OPAL_INT_TO_BOOL(value); - - OPAL_OUTPUT_VERBOSE((20, orte_snapc_base_output, - "snapc:base: open: base_global_snapshot_dir = %s (%s)", - orte_snapc_base_global_snapshot_dir, - (orte_snapc_base_is_global_dir_shared ? "Shared" : "Local") )); - - /* - * Store the checkpoint files in their final location. - * This assumes that the storage place is on a shared file - * system that all nodes can access uniformly. - * Default = enabled - */ - mca_base_param_reg_int_name("snapc", - "base_store_in_place", - "If global_snapshot_dir is on a shared file system all nodes can access, " - "then the checkpoint files can be stored in place instead of incurring a " - "remote copy. [Default = enabled]", - false, false, - 1, - &value); - orte_snapc_base_store_in_place = OPAL_INT_TO_BOOL(value); - OPAL_OUTPUT_VERBOSE((20, orte_snapc_base_output, - "snapc:base: open: base_store_in_place = %d", - orte_snapc_base_store_in_place)); - /* * Reuse sequence numbers * This will create a directory and always use seq 0 for all checkpoints @@ -149,50 +105,10 @@ int orte_snapc_base_open(void) "snapc:base: open: base_only_one_seq = %d", orte_snapc_base_store_only_one_seq)); - /* - * Pre-establish the global snapshot directory upon job registration - */ - mca_base_param_reg_int_name("snapc_base", - "establish_global_snapshot_dir", - "Establish the global snapshot directory on job startup. [Default = disabled]", - false, false, - 0, - &value); - orte_snapc_base_establish_global_snapshot_dir = OPAL_INT_TO_BOOL(value); - - OPAL_OUTPUT_VERBOSE((20, orte_snapc_base_output, - "snapc:base: open: base_establish_global_snapshot_dir = %d", - orte_snapc_base_establish_global_snapshot_dir)); - - /* - * User defined global snapshot directory name for this job - */ - mca_base_param_reg_string_name("snapc_base", - "global_snapshot_ref", - "The global snapshot reference to be used for this job. " - " [Default = ompi_global_snapshot_MPIRUNPID.ckpt]", - false, false, - NULL, - &orte_snapc_base_global_snapshot_ref); - - OPAL_OUTPUT_VERBOSE((20, orte_snapc_base_output, - "snapc:base: open: base_global_snapshot_ref = %s", - orte_snapc_base_global_snapshot_ref)); /* Init the sequence (interval) number */ orte_snapc_base_snapshot_seq_number = 0; - if( NULL == orte_snapc_base_global_snapshot_loc ) { - char *t1 = NULL; - char *t2 = NULL; - orte_snapc_base_unique_global_snapshot_name(&t1, getpid() ); - orte_snapc_base_get_global_snapshot_directory(&t2, t1 ); - orte_snapc_base_global_snapshot_loc = strdup(t2); - free(t1); - free(t2); - } - - /* * Which SnapC component to open * - NULL or "" = auto-select @@ -218,7 +134,12 @@ int orte_snapc_base_open(void) true)) { return ORTE_ERROR; } - + + /* + * Open up the SStore framework + */ + orte_sstore_base_open(); + return ORTE_SUCCESS; } diff --git a/orte/mca/snapc/base/snapc_base_select.c b/orte/mca/snapc/base/snapc_base_select.c index 4166251972..185b5f6814 100644 --- a/orte/mca/snapc/base/snapc_base_select.c +++ b/orte/mca/snapc/base/snapc_base_select.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -27,6 +27,8 @@ #include "opal/mca/base/base.h" #include "opal/mca/base/mca_base_param.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" @@ -127,6 +129,11 @@ int orte_snapc_base_select(bool seed, bool app) } } + /* + * Select on the SStore framework + */ + orte_sstore_base_select(); + cleanup: if( NULL != include_list ) { free(include_list); diff --git a/orte/mca/snapc/full/help-orte-snapc-full.txt b/orte/mca/snapc/full/help-orte-snapc-full.txt index 751623fab8..1f58ac9c6d 100644 --- a/orte/mca/snapc/full/help-orte-snapc-full.txt +++ b/orte/mca/snapc/full/help-orte-snapc-full.txt @@ -1,6 +1,6 @@ -*- text -*- # -# Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # Copyright (c) 2004-2005 The University of Tennessee and The University @@ -23,3 +23,9 @@ Warning: waitpid(%d) failed with ret = %d while waiting on process %s. Typically this means that you are stopping a restarted process. We skip the rest of the checks, since this is normally not a problem. + +[amca_param_not_found] +Warning: Unable to determine the AMCA parameter from the environment. + This is the option supplied to mpirun as '-am '. + Restart may not be able to correctly determine the correct AMCA/MCA + parameters to use when restarting. diff --git a/orte/mca/snapc/full/snapc_full.h b/orte/mca/snapc/full/snapc_full.h index 37965b16a6..f4b2346793 100644 --- a/orte/mca/snapc/full/snapc_full.h +++ b/orte/mca/snapc/full/snapc_full.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -31,7 +31,7 @@ #include "opal/event/event.h" #include "opal/util/opal_sos.h" -#include "orte/mca/filem/filem.h" +#include "orte/mca/sstore/sstore.h" #include "orte/mca/snapc/snapc.h" BEGIN_C_DECLS @@ -41,15 +41,15 @@ BEGIN_C_DECLS */ typedef uint8_t orte_snapc_full_cmd_flag_t; #define ORTE_SNAPC_FULL_CMD OPAL_UINT8 -#define ORTE_SNAPC_FULL_UPDATE_JOB_STATE_CMD 1 -#define ORTE_SNAPC_FULL_UPDATE_JOB_STATE_QUICK_CMD 2 -#define ORTE_SNAPC_FULL_UPDATE_ORTED_STATE_CMD 3 -#define ORTE_SNAPC_FULL_UPDATE_ORTED_STATE_QUICK_CMD 4 -#define ORTE_SNAPC_FULL_VPID_ASSOC_CMD 5 -#define ORTE_SNAPC_FULL_ESTABLISH_DIR_CMD 6 -#define ORTE_SNAPC_FULL_START_CKPT_CMD 7 -#define ORTE_SNAPC_FULL_END_CKPT_CMD 8 -#define ORTE_SNAPC_FULL_MAX 9 +#define ORTE_SNAPC_FULL_UPDATE_JOB_STATE_CMD 1 +#define ORTE_SNAPC_FULL_UPDATE_JOB_STATE_QUICK_CMD 2 +#define ORTE_SNAPC_FULL_UPDATE_ORTED_STATE_CMD 3 +#define ORTE_SNAPC_FULL_UPDATE_ORTED_STATE_QUICK_CMD 4 +#define ORTE_SNAPC_FULL_VPID_ASSOC_CMD 5 +#define ORTE_SNAPC_FULL_ESTABLISH_DIR_CMD 6 +#define ORTE_SNAPC_FULL_RESTART_PROC_INFO 7 +#define ORTE_SNAPC_FULL_REQUEST_OP_CMD 8 +#define ORTE_SNAPC_FULL_MAX 9 /* * Local Component structures @@ -72,15 +72,6 @@ typedef uint8_t orte_snapc_full_cmd_flag_t; /** State of the checkpoint */ int state; - - /** OPAL CRS Component */ - char * opal_crs; - - /** Checkpoint Options */ - opal_crs_base_ckpt_options_t *options; - - /** FileM request */ - orte_filem_base_request_t *filem_request; }; typedef struct orte_snapc_full_orted_snapshot_t orte_snapc_full_orted_snapshot_t; OBJ_CLASS_DECLARATION(orte_snapc_full_orted_snapshot_t); @@ -97,6 +88,7 @@ typedef uint8_t orte_snapc_full_cmd_flag_t; char * comm_pipe_w; int comm_pipe_r_fd; int comm_pipe_w_fd; + int unique_pipe_id; /* An opal event handle for the read pipe */ struct opal_event comm_pipe_r_eh; @@ -105,15 +97,18 @@ typedef uint8_t orte_snapc_full_cmd_flag_t; /** Process pid */ pid_t process_pid; - /** Options */ - opal_crs_base_ckpt_options_t *options; + /** Is this process a migration target */ + bool migrating; + + /** Finished flag */ + bool finished; }; typedef struct orte_snapc_full_app_snapshot_t orte_snapc_full_app_snapshot_t; OBJ_CLASS_DECLARATION(orte_snapc_full_app_snapshot_t); - extern bool orte_snapc_full_skip_filem; extern bool orte_snapc_full_skip_app; extern bool orte_snapc_full_timing_enabled; + extern int orte_snapc_full_progress_meter; extern int orte_snapc_full_max_wait_time; int orte_snapc_full_component_query(mca_base_module_t **module, int *priority); @@ -131,6 +126,7 @@ typedef uint8_t orte_snapc_full_cmd_flag_t; int orte_snapc_full_start_ckpt(orte_snapc_base_quiesce_t *datum); int orte_snapc_full_end_ckpt(orte_snapc_base_quiesce_t *datum); + int orte_snapc_full_request_op(orte_snapc_base_request_op_t *datum); /* * Global Coordinator Functionality @@ -146,6 +142,8 @@ typedef uint8_t orte_snapc_full_cmd_flag_t; char **agent_ckpt); int global_coord_start_ckpt(orte_snapc_base_quiesce_t *datum); int global_coord_end_ckpt(orte_snapc_base_quiesce_t *datum); + int global_coord_restart_proc_info(pid_t local_pid, + char * local_hostname); /* * Local Coordinator Functionality @@ -156,8 +154,7 @@ typedef uint8_t orte_snapc_full_cmd_flag_t; int local_coord_release_job(orte_jobid_t jobid); int local_coord_job_state_update(orte_jobid_t jobid, int job_ckpt_state, - char **job_ckpt_ref, - char **job_ckpt_loc, + orte_sstore_base_handle_t ss_handle, opal_crs_base_ckpt_options_t *options); /* @@ -166,8 +163,7 @@ typedef uint8_t orte_snapc_full_cmd_flag_t; int app_coord_init(void); int app_coord_finalize(void); int app_coord_ft_event(int state); - int app_coord_start_ckpt(orte_snapc_base_quiesce_t *datum); - int app_coord_end_ckpt(orte_snapc_base_quiesce_t *datum); + int app_coord_request_op(orte_snapc_base_request_op_t *datum); END_C_DECLS diff --git a/orte/mca/snapc/full/snapc_full_app.c b/orte/mca/snapc/full/snapc_full_app.c index e3c0ef56e0..ed23c0653f 100644 --- a/orte/mca/snapc/full/snapc_full_app.c +++ b/orte/mca/snapc/full/snapc_full_app.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -53,8 +53,11 @@ #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" #include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/grpcomm/grpcomm.h" #include "orte/mca/rml/rml.h" #include "orte/mca/rml/rml_types.h" +#include "orte/mca/routed/routed.h" +#include "orte/mca/routed/base/base.h" #include "snapc_full.h" @@ -63,16 +66,18 @@ ************************************/ static void snapc_full_app_signal_handler (int signo); static int snapc_full_app_notify_response(opal_cr_ckpt_cmd_state_t resp); -static int app_notify_resp_stage_1(opal_cr_ckpt_cmd_state_t resp, - opal_crs_base_ckpt_options_t *options); +static int app_notify_resp_stage_1(opal_cr_ckpt_cmd_state_t resp); static int app_notify_resp_stage_2(int cr_state ); -static int app_notify_resp_stage_3(int cr_state); +static int app_notify_resp_stage_3(int cr_state, bool skip_fin_msg); +static int app_define_pipe_names(void); static int snapc_full_app_notify_reopen_files(void); -static int snapc_full_app_ckpt_handshake_start(opal_crs_base_ckpt_options_t *options, - opal_cr_ckpt_cmd_state_t resp); +static int snapc_full_app_ckpt_handshake_start(opal_cr_ckpt_cmd_state_t resp); static int snapc_full_app_ckpt_handshake_end(int cr_state); static int snapc_full_app_ft_event_update_process_info(orte_process_name_t proc, pid_t pid); +static int snapc_full_app_finished_msg(int cr_state); + +static int app_notify_resp_inc_prep_only(int cr_state); static char *app_comm_pipe_r = NULL; static char *app_comm_pipe_w = NULL; @@ -81,21 +86,30 @@ static int app_comm_pipe_w_fd = -1; static opal_crs_base_snapshot_t *local_snapshot = NULL; -static int app_cur_epoch = -1; -static int app_last_epoch = -1; -static bool app_split_ckpt = false; static bool app_notif_processed = false; -static char * app_cur_global_ref = NULL; +static bool currently_migrating = false; +static bool currently_all_migrating = false; + +static bool currently_checkpointing = false; +static int current_unique_id = 0; + +static int current_cr_state = OPAL_CRS_NONE; + +static orte_sstore_base_handle_t current_ss_handle = ORTE_SSTORE_HANDLE_INVALID, last_ss_handle = ORTE_SSTORE_HANDLE_INVALID; +static opal_crs_base_ckpt_options_t *current_options = NULL; /************************ * Function Definitions ************************/ -int app_coord_init() { - int exit_status = ORTE_SUCCESS; +int app_coord_init() +{ + int ret, exit_status = ORTE_SUCCESS; opal_cr_notify_callback_fn_t prev_notify_func; - char *tmp_pid = NULL; + orte_snapc_full_cmd_flag_t command = ORTE_SNAPC_FULL_REQUEST_OP_CMD; + orte_snapc_base_request_op_event_t op_event = ORTE_SNAPC_OP_INIT; + opal_buffer_t buffer; OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, "App) Initalized for Application %s\n", @@ -106,11 +120,11 @@ int app_coord_init() { */ opal_cr_reg_notify_callback(snapc_full_app_notify_response, &prev_notify_func); - /* String representation of the PID */ - asprintf(&tmp_pid, "%d", getpid()); - - asprintf(&app_comm_pipe_r, "%s/%s.%s", opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_R, tmp_pid); - asprintf(&app_comm_pipe_w, "%s/%s.%s", opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_W, tmp_pid); + /* + * Set the pipe names + */ + current_unique_id = 0; + app_define_pipe_names(); /* * Setup a signal handler to catch and start the proper thread @@ -129,16 +143,172 @@ int app_coord_init() { "app) Named Pipes (%s) (%s), Signal (%d)", app_comm_pipe_r, app_comm_pipe_w, opal_cr_entry_point_signal)); - cleanup: - if( NULL != tmp_pid) { - free(tmp_pid); - tmp_pid = NULL; + /* + * All processes must sync here, so the Global coordinator can know that + * it is safe to checkpoint now. + * Rank 0: Sends confirmation message to the Global Coordinator + */ + if( 0 == ORTE_PROC_MY_NAME->vpid ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Startup Barrier...")); } + if( ORTE_SUCCESS != (ret = orte_grpcomm.barrier()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( 0 == ORTE_PROC_MY_NAME->vpid ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Shutdown Barrier: Send INIT to HNP...!")); + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + return ORTE_ERROR; + } + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(ORTE_PROC_MY_NAME->jobid), 1, ORTE_JOBID))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + return ORTE_ERROR; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(op_event), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + return ORTE_ERROR; + } + + OBJ_DESTRUCT(&buffer); + } + + if( 0 == ORTE_PROC_MY_NAME->vpid ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Startup Barrier: Done!")); + } + + cleanup: return exit_status; } -int app_coord_finalize() { +int app_coord_finalize() +{ + int ret, exit_status = ORTE_SUCCESS; + orte_snapc_full_cmd_flag_t command = ORTE_SNAPC_FULL_REQUEST_OP_CMD; + orte_snapc_base_request_op_event_t op_event = ORTE_SNAPC_OP_FIN; + opal_buffer_t buffer; + orte_std_cntr_t count; + + /* + * All processes must sync here, so the Global coordinator can know that + * it is no longer safe to checkpoint. + * Rank 0: Sends confirmation message to the Global Coordinator + */ + if( 0 == ORTE_PROC_MY_NAME->vpid ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Shutdown Barrier...")); + } + + if( ORTE_SUCCESS != (ret = orte_grpcomm.barrier()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( 0 == ORTE_PROC_MY_NAME->vpid ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Shutdown Barrier: Send FIN to HNP...!")); + + /* Tell HNP that we are finalizing */ + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(ORTE_PROC_MY_NAME->jobid), 1, ORTE_JOBID))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(op_event), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + OBJ_DESTRUCT(&buffer); + + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Shutdown Barrier: Waiting on FIN_ACK...!")); + + /* Wait for HNP to tell us that it is ok to finish finalization. + * We could have been checkpointing just as we entered finalize, so we + * need to wait until the checkpoint is finished before finishing. + */ + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + if (0 > (ret = orte_rml.recv_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &command, &count, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &op_event, &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Shutdown Barrier: Waiting on barrier...!")); + } + + if( ORTE_SUCCESS != (ret = orte_grpcomm.barrier()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( 0 == ORTE_PROC_MY_NAME->vpid ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "app) Shutdown Barrier, Done!")); + } + + cleanup: /* * Cleanup named pipes */ @@ -167,13 +337,19 @@ static void snapc_full_app_signal_handler (int signo) /* Not our signal */ return; } - /* - * Signal thread to start checkpoint handshake - */ - opal_cr_checkpoint_request = OPAL_CR_STATUS_REQUESTED; + if( currently_checkpointing ) { + opal_output(0, "snapc:full:(app) Error: Received a signal to checkpoint, but Already checkpointing. Ignoring request!"); + } + else { + currently_checkpointing = true; + /* + * Signal thread to start checkpoint handshake + */ + opal_cr_checkpoint_request = OPAL_CR_STATUS_REQUESTED; - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "App) signal_handler: Receive Checkpoint Request.")); + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) signal_handler: Receive Checkpoint Request.")); + } } /* @@ -181,35 +357,48 @@ static void snapc_full_app_signal_handler (int signo) */ int snapc_full_app_notify_response(opal_cr_ckpt_cmd_state_t resp) { - static opal_crs_base_ckpt_options_t *options = NULL; static int cr_state; int app_pid; int ret, exit_status = ORTE_SUCCESS; - if( NULL == options ) { - options = OBJ_NEW(opal_crs_base_ckpt_options_t); + /* + * Clear the options set + */ + if( NULL == current_options ) { + current_options = OBJ_NEW(opal_crs_base_ckpt_options_t); } if( opal_cr_currently_stalled ) { goto STAGE_1; } + /* Default: use the fast way */ + orte_cr_continue_like_restart = false; + orte_cr_flush_restart_files = true; + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "App) notify_response: Stage 1...")); - if( ORTE_SUCCESS != (ret = app_notify_resp_stage_1(resp, options) ) ) { + if( ORTE_SUCCESS != (ret = app_notify_resp_stage_1(resp) ) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto ckpt_cleanup; } - /* - * If this is a split checkpoint operation then we only need to do stage_1, - * but we need to keep the name pipe open for the end(); - */ - if( app_split_ckpt ) { - app_notif_processed = true; - return ORTE_SUCCESS; + cr_state = OPAL_CRS_RUNNING; + current_cr_state = cr_state; + +#if OPAL_ENABLE_CRDEBUG == 1 + if( current_options->attach_debugger ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) notify_response: C/R Debug: Wait for debugger...")); + MPIR_debug_with_checkpoint = true; } + if( current_options->detach_debugger ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) notify_response: C/R Debug: Do not wait for debugger...")); + MPIR_debug_with_checkpoint = false; + } +#endif OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "App) notify_response: Start checkpoint...")); @@ -246,9 +435,40 @@ int snapc_full_app_notify_response(opal_cr_ckpt_cmd_state_t resp) } /* - * INC: Take the checkpoint + * If this is a quiesce_start operation then we can stop here after calling + * the INC prep. Need to keep the connection open for the quiesce_end() + * operation though. */ - ret = opal_cr_inc_core_ckpt(app_pid, local_snapshot, options, &cr_state); + if( current_options->inc_prep_only ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) notify_response: INC Prep Only...")); + return app_notify_resp_inc_prep_only(cr_state); + } else { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) notify_response: Normal operation...")); + } + + /* + * INC: Take the checkpoint + * + * If migrating, only checkpoint if you are the target process + * otherwise just continue. + */ + if( currently_all_migrating ) { + orte_cr_continue_like_restart = true; + orte_cr_flush_restart_files = false; + } + if( !currently_migrating && currently_all_migrating ) { + OPAL_OUTPUT_VERBOSE((2, mca_snapc_full_component.super.output_handle, + "App) notify_response: Skipping App. (%d) - This process is not migrating \n", + getpid())); + ret = ORTE_SUCCESS; + cr_state = OPAL_CRS_CONTINUE; + } + else { + ret = opal_cr_inc_core_ckpt(app_pid, local_snapshot, current_options, &cr_state); + } + current_cr_state = cr_state; /* * Tell Local Coordinator that we are done with local checkpoint @@ -256,7 +476,7 @@ int snapc_full_app_notify_response(opal_cr_ckpt_cmd_state_t resp) * Coordinator. ) */ if( OPAL_CRS_RESTART != cr_state ) { - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, "App) notify_response: Stage 2...")); if( ORTE_SUCCESS != (ret = app_notify_resp_stage_2(cr_state) ) ) { ORTE_ERROR_LOG(ret); @@ -268,10 +488,46 @@ int snapc_full_app_notify_response(opal_cr_ckpt_cmd_state_t resp) /* * INC: Recover stack using the registered coordination routine */ - if( OPAL_SUCCESS != (ret = opal_cr_inc_core_recover(cr_state)) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto ckpt_cleanup; + if( !currently_all_migrating ) { + if( OPAL_SUCCESS != (ret = opal_cr_inc_core_recover(cr_state)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto ckpt_cleanup; + } + } + /* + * If this is a migrating target process, then do not recover the stack, but terminate. + * All non-migrating processes will wait in the recovery until the target processes are + * restarted on the target nodes. + */ + else { + /* + * If we are one of the processes migrating, then terminate after checkpointing + */ + if( currently_migrating ) { + if( OPAL_CRS_RESTART != cr_state ) { + current_options->term = true; + } + else { + if( OPAL_SUCCESS != (ret = opal_cr_inc_core_recover(cr_state)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto ckpt_cleanup; + } + } + } + /* + * If we are not one of the processes migrating, then wait for release. + * Need to act like we are restarting during recovery, since the migrating processes + * will expect this logic. + */ + else { + if( OPAL_SUCCESS != (ret = opal_cr_inc_core_recover(OPAL_CRS_RESTART)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto ckpt_cleanup; + } + } } } @@ -279,18 +535,18 @@ int snapc_full_app_notify_response(opal_cr_ckpt_cmd_state_t resp) opal_cr_stall_check = false; if(OPAL_CRS_RESTART == cr_state) { - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "App) notify_response: Restarting...(%d)\n", - getpid())); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) notify_response: Restarting... (%s : %d)\n", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), getpid() )); - options->term = false; + current_options->term = false; /* Do not respond to the non-existent command line tool */ goto ckpt_cleanup; } else if(cr_state == OPAL_CRS_CONTINUE) { - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "App) notify_response: Continuing...(%d)\n", - getpid())); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) notify_response: Continuing...(%s : %d)\n", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), getpid() )); ; /* Don't need to do anything here */ } else if(cr_state == OPAL_CRS_TERM ) { @@ -303,33 +559,40 @@ int snapc_full_app_notify_response(opal_cr_ckpt_cmd_state_t resp) } ckpt_cleanup: - - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, "App) notify_response: Stage 3...")); - if( ORTE_SUCCESS != (ret = app_notify_resp_stage_3(cr_state) )) { + if( ORTE_SUCCESS != (ret = app_notify_resp_stage_3(cr_state, false) )) { ORTE_ERROR_LOG(ret); exit_status = ret; goto ckpt_cleanup; } - - if( options->term ) { + + if( current_options->term ) { OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "App) notify_response: User has asked to terminate the application")); - exit(ORTE_SUCCESS); + /* Wait here for termination. + * If we call 'exit' then the job will fail in an ugly way, instead just + * wait for the Global coordinator to terminate us. + */ + while(1) { + opal_progress(); + sleep(1); + } } - if( NULL != options ) { - OBJ_RELEASE(options); - options = NULL; + if( NULL != current_options ) { + OBJ_RELEASE(current_options); + current_options = NULL; } + currently_checkpointing = false; + return exit_status; } -static int app_notify_resp_stage_1(opal_cr_ckpt_cmd_state_t resp, - opal_crs_base_ckpt_options_t *options) +static int app_notify_resp_stage_1(opal_cr_ckpt_cmd_state_t resp) { - int ret; + int ret, exit_status = ORTE_SUCCESS; OPAL_CR_CLEAR_TIMERS(); opal_cr_timing_my_rank = ORTE_PROC_MY_NAME->vpid; @@ -342,7 +605,8 @@ static int app_notify_resp_stage_1(opal_cr_ckpt_cmd_state_t resp, "App) notify_response: Open Communication Channels.")); if (ORTE_SUCCESS != (ret = snapc_full_app_notify_reopen_files())) { ORTE_ERROR_LOG(ret); - return ret; + exit_status = ret; + goto cleanup; } /* @@ -350,41 +614,101 @@ static int app_notify_resp_stage_1(opal_cr_ckpt_cmd_state_t resp, */ OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "App) notify_response: Initial Handshake.")); - if( ORTE_SUCCESS != (ret = snapc_full_app_ckpt_handshake_start(options, resp) ) ) { + if( ORTE_SUCCESS != (ret = snapc_full_app_ckpt_handshake_start(resp) ) ) { ORTE_ERROR_LOG(ret); - return ret; + exit_status = ret; + goto cleanup; } OPAL_CR_SET_TIMER(OPAL_CR_TIMER_ENTRY1); /* - * Begin checkpoint - * - Init the checkpoint metadata file + * Register with SStore */ OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "App) notify_response: Init checkpoint directory...")); - if( OPAL_SUCCESS != (ret = opal_crs_base_init_snapshot_directory(local_snapshot) ) ) { - opal_output(0, "App) Error: Unable to initalize the snapshot directory!\n"); + "App) notify_response: Register with SStore...")); + if( OPAL_SUCCESS != (ret = orte_sstore.register_handle(current_ss_handle)) ) { ORTE_ERROR_LOG(ret); - return ret; + exit_status = ret; + goto cleanup; } + local_snapshot = OBJ_NEW(opal_crs_base_snapshot_t); + + if( !currently_migrating && currently_all_migrating ) { + orte_sstore.set_attr(current_ss_handle, + SSTORE_METADATA_LOCAL_SKIP_CKPT, + "1"); + } + + orte_sstore.get_attr(current_ss_handle, + SSTORE_METADATA_LOCAL_SNAP_LOC, + &(local_snapshot->snapshot_directory)); + orte_sstore.get_attr(current_ss_handle, + SSTORE_METADATA_LOCAL_SNAP_META, + &(local_snapshot->metadata_filename)); + OPAL_CR_SET_TIMER(OPAL_CR_TIMER_ENTRY2); OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "App) notify_response: Start checkpoint...")); + "App) notify_response: Start checkpoint... (%d)", (int)current_ss_handle)); - return ORTE_SUCCESS; + cleanup: + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) notify_response: Are we migrating [%5s]. Am I migrating [%5s]", + (currently_all_migrating ? "True" : "False"), + (currently_migrating ? "True" : "False") )); + + return exit_status; +} + +static int app_notify_resp_inc_prep_only(int cr_state) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* + * Tell the local coordinator that we are done with the INC prep + */ + if( sizeof(int) != (ret = write(app_comm_pipe_w_fd, &cr_state, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to write cr_state to named pipe (%s).\n", + app_comm_pipe_w); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + app_notif_processed = true; + + cleanup: + return exit_status; } static int app_notify_resp_stage_2(int cr_state ) { int ret; + OPAL_CR_SET_TIMER(OPAL_CR_TIMER_ENTRY3); + + /* + * Sync SStore + * If we stopped the process, then we already did this + */ + if( !(current_options->stop) ) { + if( currently_migrating || !currently_all_migrating ) { + orte_sstore.set_attr(current_ss_handle, + SSTORE_METADATA_LOCAL_CRS_COMP, + local_snapshot->component_name); + } + + orte_sstore.sync(current_ss_handle); + } + last_ss_handle = current_ss_handle; + current_ss_handle = 0; + /* * Final Handshake */ - OPAL_CR_SET_TIMER(OPAL_CR_TIMER_ENTRY3); OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "App) notify_response: Waiting for final handshake.")); if( ORTE_SUCCESS != (ret = snapc_full_app_ckpt_handshake_end(cr_state ) ) ) { @@ -398,8 +722,42 @@ static int app_notify_resp_stage_2(int cr_state ) return ORTE_SUCCESS; } -static int app_notify_resp_stage_3(int cr_state) +static int app_define_pipe_names(void) { + if( NULL != app_comm_pipe_r ) { + free(app_comm_pipe_r); + app_comm_pipe_r = NULL; + } + + if( NULL != app_comm_pipe_w ) { + free(app_comm_pipe_w); + app_comm_pipe_w = NULL; + } + + asprintf(&app_comm_pipe_r, "%s/%s.%d_%d", + opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_R, + (int)getpid(), current_unique_id); + asprintf(&app_comm_pipe_w, "%s/%s.%d_%d", + opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_W, + (int)getpid(), current_unique_id); + + ++current_unique_id; + + return ORTE_SUCCESS; +} + +static int app_notify_resp_stage_3(int cr_state, bool skip_fin_msg) +{ + /* + * Send a message to the local daemon letting it know that we are done + */ + if( !skip_fin_msg ) { + snapc_full_app_finished_msg(cr_state); + } + + /* + * Close and cleanup pipes + */ if( 0 <= app_comm_pipe_r_fd ) { close(app_comm_pipe_r_fd); app_comm_pipe_r_fd = -1; @@ -415,10 +773,19 @@ static int app_notify_resp_stage_3(int cr_state) app_comm_pipe_r_fd = -1; app_comm_pipe_w_fd = -1; + if( OPAL_CRS_RESTART == cr_state ) { + current_unique_id = 0; + } + + app_define_pipe_names(); + /* Prepare to wait for another checkpoint action */ opal_cr_checkpointing_state = OPAL_CR_STATUS_NONE; opal_cr_currently_stalled = false; + currently_all_migrating = false; + currently_migrating = false; + OPAL_CR_SET_TIMER(OPAL_CR_TIMER_ENTRY4); if(OPAL_CRS_RESTART != cr_state) { OPAL_CR_DISPLAY_ALL_TIMERS(); @@ -427,6 +794,37 @@ static int app_notify_resp_stage_3(int cr_state) return ORTE_SUCCESS; } +static int snapc_full_app_finished_msg(int cr_state) { + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_snapc_cmd_flag_t command = ORTE_SNAPC_LOCAL_FINISH_CMD; + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &cr_state, 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, &buffer, ORTE_RML_TAG_SNAPC, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + static int snapc_full_app_notify_reopen_files(void) { int ret = OPAL_ERR_NOT_IMPLEMENTED; @@ -492,19 +890,35 @@ static int snapc_full_app_notify_reopen_files(void) #endif /* HAVE_MKFIFO */ } -static int snapc_full_app_ckpt_handshake_start(opal_crs_base_ckpt_options_t *options, - opal_cr_ckpt_cmd_state_t resp) +static int snapc_full_app_ckpt_handshake_start(opal_cr_ckpt_cmd_state_t resp) { int ret, exit_status = ORTE_SUCCESS; - int len = 0, tmp_resp, opt_rep; - char *tmp_str = NULL; - ssize_t tmp_size = 0; + int tmp_resp, opt_rep; /* * Get the initial handshake command: + * - Migrating option [all, me] * - Term argument * - Stop argument */ + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to read the all_migrating option from named pipe (%s). %d\n", + app_comm_pipe_r, ret); + ORTE_ERROR_LOG(ret); + goto cleanup; + } + currently_all_migrating = OPAL_INT_TO_BOOL(opt_rep); + + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to read the migrating option from named pipe (%s). %d\n", + app_comm_pipe_r, ret); + ORTE_ERROR_LOG(ret); + goto cleanup; + } + currently_migrating = OPAL_INT_TO_BOOL(opt_rep); + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { opal_output(mca_snapc_full_component.super.output_handle, "App) notify_response: Error: Unable to read the 'term' from named pipe (%s). %d\n", @@ -512,7 +926,7 @@ static int snapc_full_app_ckpt_handshake_start(opal_crs_base_ckpt_options_t *opt ORTE_ERROR_LOG(ret); goto cleanup; } - options->term = OPAL_INT_TO_BOOL(opt_rep); + current_options->term = OPAL_INT_TO_BOOL(opt_rep); if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { opal_output(mca_snapc_full_component.super.output_handle, @@ -521,8 +935,64 @@ static int snapc_full_app_ckpt_handshake_start(opal_crs_base_ckpt_options_t *opt ORTE_ERROR_LOG(ret); goto cleanup; } - options->stop = OPAL_INT_TO_BOOL(opt_rep); + current_options->stop = OPAL_INT_TO_BOOL(opt_rep); + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to read the 'inc_prep_only' from named pipe (%s). %d\n", + app_comm_pipe_r, ret); + ORTE_ERROR_LOG(ret); + goto cleanup; + } + current_options->inc_prep_only = OPAL_INT_TO_BOOL(opt_rep); + + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to read the 'inc_recover_only' from named pipe (%s). %d\n", + app_comm_pipe_r, ret); + ORTE_ERROR_LOG(ret); + goto cleanup; + } + current_options->inc_recover_only = OPAL_INT_TO_BOOL(opt_rep); + +#if OPAL_ENABLE_CRDEBUG == 1 + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to read the 'attach_debugger' from named pipe (%s). %d\n", + app_comm_pipe_r, ret); + ORTE_ERROR_LOG(ret); + goto cleanup; + } + current_options->attach_debugger = OPAL_INT_TO_BOOL(opt_rep); + + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to read the 'detach_debugger' from named pipe (%s). %d\n", + app_comm_pipe_r, ret); + ORTE_ERROR_LOG(ret); + goto cleanup; + } + current_options->detach_debugger = OPAL_INT_TO_BOOL(opt_rep); +#endif + + /* + * Get SStore Handle + */ + if( sizeof(orte_sstore_base_handle_t) != (ret = read(app_comm_pipe_r_fd, ¤t_ss_handle, sizeof(orte_sstore_base_handle_t))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "App) notify_response: Error: Unable to read the sstore handle from named pipe (%s). %d\n", + app_comm_pipe_r, ret); + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) %s Received Options... Responding with %d\n", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), (int)resp)); + + /* + * Write back the response to the request (message printed below) + */ tmp_resp = (int)resp; if( sizeof(int) != (ret = write(app_comm_pipe_w_fd, &tmp_resp, sizeof(int)) ) ) { opal_output(mca_snapc_full_component.super.output_handle, @@ -531,7 +1001,7 @@ static int snapc_full_app_ckpt_handshake_start(opal_crs_base_ckpt_options_t *opt ORTE_ERROR_LOG(ret); goto cleanup; } - + /* * Respond that the checkpoint is currently in progress */ @@ -572,130 +1042,20 @@ static int snapc_full_app_ckpt_handshake_start(opal_crs_base_ckpt_options_t *opt getpid())); /* - * Get Snapshot Handle argument + * Get the sentinel value indicating that we can start now + * JJH: Check for an error here indicating that even though this process is + * OK to checkpoint others might not be in which case we should cleanup + * properly. */ - if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &len, sizeof(int))) ) { + if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &opt_rep, sizeof(int))) ) { opal_output(mca_snapc_full_component.super.output_handle, - "App) notify_response: Error: Unable to read the snapshot_handle len from named pipe (%s). %d\n", + "App) notify_response: Error: Unable to read from named pipe (%s). %d\n", app_comm_pipe_r, ret); ORTE_ERROR_LOG(ret); goto cleanup; } - - tmp_size = sizeof(char) * len; - tmp_str = (char *) malloc(sizeof(char) * len); - if( tmp_size != (ret = read(app_comm_pipe_r_fd, tmp_str, (sizeof(char) * len))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "App) notify_response: Error: Unable to read the snapshot_handle from named pipe (%s). %d\n", - app_comm_pipe_r, ret); - ORTE_ERROR_LOG(ret); - goto cleanup; - } - - /* - * If they didn't send anything of meaning then use the defaults - */ - local_snapshot = OBJ_NEW(opal_crs_base_snapshot_t); - - if( 1 < strlen(tmp_str) ) { - if( NULL != local_snapshot->reference_name) - free( local_snapshot->reference_name ); - local_snapshot->reference_name = strdup(tmp_str); - - if( NULL != local_snapshot->local_location ) - free( local_snapshot->local_location ); - local_snapshot->local_location = opal_crs_base_get_snapshot_directory(local_snapshot->reference_name); - - if( NULL != local_snapshot->remote_location ) - free( local_snapshot->remote_location ); - local_snapshot->remote_location = strdup(local_snapshot->local_location); - } - if( NULL != tmp_str ) { - free(tmp_str); - tmp_str = NULL; - } - - /* - * Get Snapshot location argument - */ - if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &len, sizeof(int))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "App) notify_response: Error: Unable to read the snapshot_location len from named pipe (%s). %d\n", - app_comm_pipe_r, ret); - goto cleanup; - } - - tmp_str = (char *) malloc(sizeof(char) * len); - tmp_size = sizeof(char) * len; - if( tmp_size != (ret = read(app_comm_pipe_r_fd, tmp_str, (sizeof(char) * len))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "App) notify_response: Error: Unable to read the snapshot_location from named pipe (%s). %d\n", - app_comm_pipe_r, ret); - goto cleanup; - } - - /* - * If they didn't send anything of meaning then use the defaults - */ - if( 1 < strlen(tmp_str) ) { - if( NULL != local_snapshot->local_location) - free( local_snapshot->local_location ); - asprintf(&(local_snapshot->local_location), "%s/%s", tmp_str, local_snapshot->reference_name); - - if( NULL != local_snapshot->remote_location) - free( local_snapshot->remote_location ); - local_snapshot->remote_location = strdup(local_snapshot->local_location); - } - - if( NULL != tmp_str ) { - free(tmp_str); - tmp_str = NULL; - } - - /* - * Get Global Snapshot Ref - */ - if( sizeof(int) != (ret = read(app_comm_pipe_r_fd, &len, sizeof(int))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "App) notify_response: Error: Unable to read the global snapshot ref len from named pipe (%s). %d\n", - app_comm_pipe_r, ret); - ORTE_ERROR_LOG(ret); - goto cleanup; - } - - tmp_str = (char *) malloc(sizeof(char) * len); - tmp_size = sizeof(char) * len; - if( tmp_size != (ret = read(app_comm_pipe_r_fd, tmp_str, (sizeof(char) * len))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "App) notify_response: Error: Unable to read the global snapshot ref from named pipe (%s). %d\n", - app_comm_pipe_r, ret); - ORTE_ERROR_LOG(ret); - goto cleanup; - } - if( NULL != app_cur_global_ref ) { - free(app_cur_global_ref); - app_cur_global_ref = NULL; - } - app_cur_global_ref = strdup(tmp_str); - - /* - * Get the Seq. Number - */ - if( sizeof(size_t) != (ret = read(app_comm_pipe_r_fd, &tmp_size, sizeof(size_t))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "App) notify_response: Error: Unable to read the global snapshot seq number from named pipe (%s). %d\n", - app_comm_pipe_r, ret); - ORTE_ERROR_LOG(ret); - goto cleanup; - } - app_cur_epoch = (int)tmp_size; cleanup: - if( NULL != tmp_str ) { - free(tmp_str); - tmp_str = NULL; - } - return exit_status; } @@ -718,6 +1078,20 @@ static int snapc_full_app_ckpt_handshake_end(int cr_state) goto cleanup; } + if( currently_all_migrating && currently_migrating ) { + app_notify_resp_stage_3(cr_state, true); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) handshake_end: Waiting for termination (%d)", + getpid())); + /* Wait here for termination, do not terminate ourselves. + * JJH: We cannot terminate ourselves without killing the job... + */ + while(1) { + opal_progress(); + sleep(1); + } + } + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, "App) handshake_end: Waiting for release (%d)", getpid())); @@ -750,21 +1124,41 @@ int app_coord_ft_event(int state) { /******** Checkpoint Prep ********/ if(OPAL_CRS_CHECKPOINT == state) { - ; /* Nothing */ + /* + * If stopping then sync early + */ + if( current_options->stop ) { + orte_sstore.set_attr(current_ss_handle, + SSTORE_METADATA_LOCAL_CRS_COMP, + opal_crs_base_selected_component.base_version.mca_component_name); + + orte_sstore.sync(current_ss_handle); + } } /******** Continue Recovery ********/ else if (OPAL_CRS_CONTINUE == state ) { +#if OPAL_ENABLE_CRDEBUG == 1 + /* + * Send PID to HNP/daemon if debugging as an indicator that we have + * finished the checkpoint operation. + */ + if( ORTE_SUCCESS != (ret = snapc_full_app_ft_event_update_process_info(orte_process_info.my_name, getpid())) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } +#endif ; /* Nothing */ } /******** Restart Pre-Recovery ********/ else if (OPAL_CRS_RESTART_PRE == state ) { - ; + ; /* Nothing */ } /******** Restart Recovery ********/ else if (OPAL_CRS_RESTART == state ) { OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "App) Initalized for Application %s (Restart)\n", - ORTE_NAME_PRINT(ORTE_PROC_MY_NAME))); + "App) Initalized for Application %s (Restart) (%5d)\n", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), getpid())); /* * Send new PID to HNP/daemon @@ -779,6 +1173,23 @@ int app_coord_ft_event(int state) { exit_status = ret; goto cleanup; } + /* + * Since this process is interacting with an 'old' daemon, we must make + * sure to sync twice. + * JJH: This assumes that we only move whole nodes, this may be wrong + * JJH: when interacting with partial migration + */ + if( currently_all_migrating && !currently_migrating ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "App) ft_event(RESTART): Not a migrating process, so re-sync")); + orte_routed_base_register_sync(false); + } + + /* + * JJH: Optionally the non-migrating processes can wait here in stage_2 + * JJH: This will delay the initial checkpoint, but potentially speed up + * JJH: restart. + */ } /******** Termination ********/ else if (OPAL_CRS_TERM == state ) { @@ -807,6 +1218,7 @@ static int snapc_full_app_ft_event_update_process_info(orte_process_name_t proc, goto cleanup; } + /* JJH CLEANUP: Do we really need this, it is equal to sender */ if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &proc, 1, ORTE_NAME))) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -819,6 +1231,14 @@ static int snapc_full_app_ft_event_update_process_info(orte_process_name_t proc, goto cleanup; } +#if OPAL_ENABLE_CRDEBUG == 1 + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &MPIR_debug_with_checkpoint, 1, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } +#endif + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, &buffer, ORTE_RML_TAG_SNAPC, 0))) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -831,22 +1251,92 @@ static int snapc_full_app_ft_event_update_process_info(orte_process_name_t proc, return exit_status; } -int app_coord_start_ckpt(orte_snapc_base_quiesce_t *datum) +int app_coord_request_op(orte_snapc_base_request_op_t *datum) { int ret, exit_status = ORTE_SUCCESS; - orte_snapc_full_cmd_flag_t command = ORTE_SNAPC_FULL_START_CKPT_CMD; + orte_snapc_full_cmd_flag_t command = ORTE_SNAPC_FULL_REQUEST_OP_CMD; opal_buffer_t buffer; + orte_std_cntr_t count; + int op_event, op_state; + char *seq_str = NULL, *tmp_str = NULL; + int cr_state = OPAL_CRS_CONTINUE; + int app_pid, i; /* - * Identify this as a split checkpoint + * Quiesce_end recovers the library before talking to the Global coord. */ - app_split_ckpt = true; + if( ORTE_SNAPC_OP_QUIESCE_END == datum->event) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Quiesce_end: Recovering the stack...")); + + /* + * INC: Recover the stack + */ + if( NULL == local_snapshot->component_name ) { + local_snapshot->component_name = strdup(""); + } + if( ORTE_SUCCESS != (ret = app_notify_resp_stage_2(cr_state) ) ) { + exit_status = ret; + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if(OPAL_SUCCESS != (ret = opal_cr_inc_core_recover(cr_state) ) ) { + exit_status = ret; + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = app_notify_resp_stage_3(cr_state, false) )) { + exit_status = ret; + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + currently_checkpointing = false; + app_notif_processed = false; + + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Quiesce_end: Recovered.")); + } + else if( ORTE_SNAPC_OP_QUIESCE_CHECKPOINT == datum->event) { + app_pid = getpid(); + cr_state = OPAL_CRS_RUNNING; + if( OPAL_SUCCESS != (ret = opal_cr_inc_core_ckpt(app_pid, local_snapshot, current_options, &cr_state)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + } + + if( OPAL_CRS_RESTART != cr_state ) { + orte_sstore.sync(current_ss_handle); + } + + orte_sstore.get_attr(current_ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_SEQ, + &seq_str); + if( NULL != seq_str ) { + datum->seq_num = atoi(seq_str); + } else { + datum->seq_num = -1; + } + + orte_sstore.get_attr(current_ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_REF, + &(datum->global_handle)); + if( NULL == datum->global_handle ) { + datum->global_handle = strdup("Unknown"); + } + + return exit_status; + } /* - * Rank 0: Contact HNP to start checkpoint - * Rank *: Wait for HNP to xcast epoch + * Leader: Send the info to the head node */ - if( 0 == ORTE_PROC_MY_NAME->vpid ) { + if( datum->leader == (int)ORTE_PROC_MY_NAME->vpid ) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Request_op: Sending request (%3d)...", + datum->event)); /* * Send request to HNP */ @@ -856,140 +1346,308 @@ int app_coord_start_ckpt(orte_snapc_base_quiesce_t *datum) ORTE_ERROR_LOG(ret); exit_status = ret; OBJ_DESTRUCT(&buffer); - return ORTE_ERROR; + goto cleanup; } if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(ORTE_PROC_MY_NAME->jobid), 1, ORTE_JOBID))) { ORTE_ERROR_LOG(ret); exit_status = ret; OBJ_DESTRUCT(&buffer); - return ORTE_ERROR; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(datum->event), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + if( ORTE_SNAPC_OP_RESTART == datum->event) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(datum->seq_num), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(datum->global_handle), 1, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + } + else if( ORTE_SNAPC_OP_MIGRATE == datum->event) { + /* + * Check information + * Rank | Hostname | cr_off_node | Meaning + * -------+-----------+--------------+--------- + * self | home/same | false | Do not move this process + * | | true | ERROR + * | NULL | false | Move wherever + * | | true | Move off of this node + * | other | false/true | Move to the 'other' node + * -------+-----------+--------------+--------- + * peer | home/same | false | Move 'peer' to me + * | | true | ERROR + * | NULL | false | Move wherever (Default: Move 'peer' to me) + * | | true | Move with peer to some other node + * | other | false/true | Move with peer to 'other' node + * -------+-----------+--------------+--------- + * If 'rank' is set to a peer other than self, and the peer sets + * conflicting 'hostname' or 'cr_off_node' preferences, then that + * is an error. In which case the migration should fail. + */ + currently_all_migrating = true; + + /* + * Send information + */ + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(datum->mig_num), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + for( i = 0; i < datum->mig_num; ++i ) { + OPAL_OUTPUT_VERBOSE((30, mca_snapc_full_component.super.output_handle, + "App) Migration %3d/%3d: Sending Rank %3d - Requested <%s> (%3d) %c\n", + datum->mig_num, i, + (datum->mig_vpids)[i], + (datum->mig_host_pref)[i], + (datum->mig_vpid_pref)[i], + (OPAL_INT_TO_BOOL((datum->mig_off_node)[i]) ? 'T' : 'F') + )); + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &((datum->mig_vpids)[i]), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + tmp_str = strdup((datum->mig_host_pref)[i]); + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &tmp_str, 1, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &((datum->mig_vpid_pref)[i]), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &((datum->mig_off_node)[i]), 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + } } if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { ORTE_ERROR_LOG(ret); exit_status = ret; OBJ_DESTRUCT(&buffer); - return ORTE_ERROR; + goto cleanup; } OBJ_DESTRUCT(&buffer); } - while( app_cur_epoch < 0 || !app_notif_processed ) { - opal_progress(); - opal_event_loop(OPAL_EVLOOP_NONBLOCK); - OPAL_CR_TEST_CHECKPOINT_READY(); - } - - datum->epoch = app_cur_epoch; - asprintf(&(datum->handle), "[%s:%s:%d]", app_cur_global_ref, local_snapshot->reference_name, app_cur_epoch); - datum->target_dir = strdup(local_snapshot->local_location); - /* - * INC: Prepare the stack + * Wait for the response */ - if(OPAL_SUCCESS != (ret = opal_cr_inc_core_prep() ) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; + if( ORTE_SNAPC_OP_CHECKPOINT == datum->event) { + if( datum->leader == (int)ORTE_PROC_MY_NAME->vpid ) { + /* + * Wait for local completion (need to check to see if we are restarting) + */ + while(OPAL_CRS_CONTINUE != current_cr_state && + OPAL_CRS_RESTART != current_cr_state && + OPAL_CRS_ERROR != current_cr_state ) { + opal_progress(); + OPAL_CR_TEST_CHECKPOINT_READY(); + } + + /* Do not wait for a response if we are restarting (it will never arrive) */ + if( OPAL_CRS_RESTART == current_cr_state ) { + orte_sstore.get_attr(current_ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_SEQ, + &seq_str); + if( NULL != seq_str ) { + datum->seq_num = atoi(seq_str); + } else { + datum->seq_num = -1; + } + + orte_sstore.get_attr(current_ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_REF, + &(datum->global_handle)); + if( NULL == datum->global_handle ) { + datum->global_handle = strdup("Unknown"); + } + + current_cr_state = OPAL_CRS_NONE; + + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + /* + * Wait for a response regarding completion + */ + if (0 > (ret = orte_rml.recv_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &command, &count, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &op_event, &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &op_state, &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OBJ_DESTRUCT(&buffer); + + orte_sstore.get_attr(last_ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_SEQ, + &seq_str); + datum->seq_num = atoi(seq_str); + + orte_sstore.get_attr(last_ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_REF, + &(datum->global_handle)); + } } - - opal_cr_checkpointing_state = OPAL_CR_STATUS_RUNNING; - - return ORTE_SUCCESS; -} - -int app_coord_end_ckpt(orte_snapc_base_quiesce_t *datum) -{ - int ret, exit_status = ORTE_SUCCESS; - orte_snapc_full_cmd_flag_t command = ORTE_SNAPC_FULL_END_CKPT_CMD; - opal_buffer_t buffer; - - if( datum->restarting ) { - datum->cr_state = OPAL_CRS_RESTART; - } else { - datum->cr_state = OPAL_CRS_CONTINUE; - } - /* - * INC: Recover the stack + * Restart will terminate this process, so just wait... */ - if(OPAL_SUCCESS != (ret = opal_cr_inc_core_recover(datum->cr_state) ) ) { - ORTE_ERROR_LOG(ret); - return ret; - } - - if( datum->cr_state != OPAL_CRS_CONTINUE ) { - if( ORTE_SUCCESS != (ret = app_notify_resp_stage_3(datum->cr_state) )) { - ORTE_ERROR_LOG(ret); - return ret; + else if( ORTE_SNAPC_OP_RESTART == datum->event) { + while( 1 ) { + opal_progress(); + OPAL_CR_TEST_CHECKPOINT_READY(); + sleep(1); } - goto cleanup; } - - if( ORTE_SUCCESS != (ret = app_notify_resp_stage_2(datum->cr_state) ) ) { - ORTE_ERROR_LOG(ret); - return ret; - } - - if( ORTE_SUCCESS != (ret = app_notify_resp_stage_3(datum->cr_state) )) { - ORTE_ERROR_LOG(ret); - return ret; - } - /* - * Rank 0: Contact HNP to let them know we are done - * Then return to application + * Leader waits for response */ - if( 0 == ORTE_PROC_MY_NAME->vpid ) { - /* - * Send request to HNP - */ - OBJ_CONSTRUCT(&buffer, opal_buffer_t); + else if( ORTE_SNAPC_OP_MIGRATE == datum->event) { + if( datum->leader == (int)ORTE_PROC_MY_NAME->vpid ) { + while( currently_all_migrating ) { + opal_progress(); + OPAL_CR_TEST_CHECKPOINT_READY(); + sleep(1); + } + + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Request_op: Leader waiting for Migrate release (%3d)...", + datum->event)); + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + /* + * Wait for a response regarding completion + */ + if (0 > (ret = orte_rml.recv_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + OBJ_DESTRUCT(&buffer); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &command, &count, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &op_event, &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &op_state, &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_FULL_CMD))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; OBJ_DESTRUCT(&buffer); - return ORTE_ERROR; + + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Request_op: Leader continuing from Migration (%3d)...", + datum->event)); + } + } + /* + * Everyone waits here for completion of Quiesce start + */ + else if( ORTE_SNAPC_OP_QUIESCE_START == datum->event) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Quiesce_start: Waiting for release...")); + + while( !app_notif_processed ) { + opal_progress(); + OPAL_CR_TEST_CHECKPOINT_READY(); } - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(ORTE_PROC_MY_NAME->jobid), 1, ORTE_JOBID))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - OBJ_DESTRUCT(&buffer); - return ORTE_ERROR; - } + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Quiesce_start: Released")); + } + /* + * No waiting for Quiesce end (barrier occurs in protocol) + */ + else if( ORTE_SNAPC_OP_QUIESCE_END == datum->event) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Quiesce_end: Waiting for release...")); - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(datum->epoch), 1, OPAL_INT))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - OBJ_DESTRUCT(&buffer); - return ORTE_ERROR; - } - - if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - OBJ_DESTRUCT(&buffer); - return ORTE_ERROR; - } - - OBJ_DESTRUCT(&buffer); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "App) Quiesce_end: Released")); } - app_last_epoch = datum->epoch; - app_cur_epoch = -1; - if( NULL != app_cur_global_ref ) { - free(app_cur_global_ref); - app_cur_global_ref = NULL; - } cleanup: - /* - * Split checkpoint complete - */ - app_split_ckpt = false; - app_notif_processed = false; + if( NULL != seq_str ) { + free(seq_str); + seq_str = NULL; + } - return ORTE_SUCCESS; + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + return exit_status; } diff --git a/orte/mca/snapc/full/snapc_full_component.c b/orte/mca/snapc/full/snapc_full_component.c index cd96ab0b65..cf1653e48d 100644 --- a/orte/mca/snapc/full/snapc_full_component.c +++ b/orte/mca/snapc/full/snapc_full_component.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -33,9 +33,9 @@ const char *orte_snapc_full_component_version_string = static int snapc_full_open(void); static int snapc_full_close(void); -bool orte_snapc_full_skip_filem = false; bool orte_snapc_full_skip_app = false; bool orte_snapc_full_timing_enabled = false; +int orte_snapc_full_progress_meter = 0; int orte_snapc_full_max_wait_time = 20; /* @@ -107,14 +107,6 @@ static int snapc_full_open(void) mca_snapc_full_component.super.output_handle = orte_snapc_base_output; } - mca_base_param_reg_int(&mca_snapc_full_component.super.base_version, - "skip_filem", - "Not for general use! For debugging only! Pretend to move files. [Default = disabled]", - false, false, - 0, - &value); - orte_snapc_full_skip_filem = OPAL_INT_TO_BOOL(value); - mca_base_param_reg_int(&mca_snapc_full_component.super.base_version, "skip_app", "Not for general use! For debugging only! Shortcut app level coord. [Default = disabled]", @@ -138,6 +130,14 @@ static int snapc_full_open(void) 20, &orte_snapc_full_max_wait_time); + mca_base_param_reg_int(&mca_snapc_full_component.super.base_version, + "progress_meter", + "Display Progress every X percentage done. [Default = 0/off]", + false, false, + 0, + &value); + orte_snapc_full_progress_meter = (value % 101); + /* * Debug Output */ @@ -150,11 +150,11 @@ static int snapc_full_open(void) "snapc:full: open: verbosity = %d", mca_snapc_full_component.super.verbose); opal_output_verbose(20, mca_snapc_full_component.super.output_handle, - "snapc:full: open: max_wait_time = %d", + "snapc:full: open: max_wait_time = %d", orte_snapc_full_max_wait_time); opal_output_verbose(20, mca_snapc_full_component.super.output_handle, - "snapc:full: open: skip_filem = %s", - (orte_snapc_full_skip_filem == true ? "True" : "False")); + "snapc:full: open: progress_meter = %d", + orte_snapc_full_progress_meter); return ORTE_SUCCESS; } diff --git a/orte/mca/snapc/full/snapc_full_global.c b/orte/mca/snapc/full/snapc_full_global.c index 450fa42def..5780bb96ba 100644 --- a/orte/mca/snapc/full/snapc_full_global.c +++ b/orte/mca/snapc/full/snapc_full_global.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -25,9 +25,11 @@ #include #endif +#include "opal/include/opal/prefetch.h" #include "opal/util/output.h" #include "opal/util/opal_environ.h" #include "opal/util/basename.h" +#include "opal/util/show_help.h" #include "opal/mca/mca.h" #include "opal/mca/base/base.h" #include "opal/mca/base/mca_base_param.h" @@ -43,10 +45,10 @@ #include "orte/mca/rmaps/rmaps.h" #include "orte/mca/rmaps/rmaps_types.h" #include "orte/mca/plm/plm.h" -#include "orte/mca/filem/filem.h" #include "orte/mca/grpcomm/grpcomm.h" #include "orte/runtime/orte_wait.h" #include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" @@ -69,25 +71,21 @@ static orte_jobid_t current_global_jobid = ORTE_JOBID_INVALID; static orte_snapc_base_global_snapshot_t global_snapshot; +static int current_total_orteds = 0; static bool updated_job_to_running; static int current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_NONE; +static bool cleanup_on_establish = false; static bool global_coord_has_local_children = false; -static bool wait_all_xfer = false; -static opal_crs_base_ckpt_options_t *current_options = NULL; - -static double timer_start = 0; -static double timer_local_done = 0; -static double timer_xfer_done = 0; -static double timer_end = 0; -static double get_time(void); -static void print_time(void); +static bool currently_migrating = false; +static opal_list_t *migrating_procs = NULL; static int global_init_job_structs(void); static int global_refresh_job_structs(void); static bool snapc_orted_recv_issued = false; static bool is_orte_checkpoint_connected = false; +static bool is_app_checkpointable = false; static int snapc_full_global_start_listener(void); static int snapc_full_global_stop_listener(void); static void snapc_full_global_orted_recv(int status, @@ -97,10 +95,11 @@ static void snapc_full_global_orted_recv(int status, void* cbdata); static void snapc_full_process_orted_request_cmd(int fd, short event, void *cbdata); -static void snapc_full_process_start_ckpt_cmd(orte_process_name_t* sender, +static void snapc_full_process_restart_proc_info_cmd(orte_process_name_t* sender, + opal_buffer_t* buffer); + +static void snapc_full_process_request_op_cmd(orte_process_name_t* sender, opal_buffer_t* buffer); -static void snapc_full_process_end_ckpt_cmd(orte_process_name_t* sender, - opal_buffer_t* buffer); /*** Command Line Interactions */ static orte_process_name_t orte_checkpoint_sender = {ORTE_JOBID_INVALID, ORTE_VPID_INVALID}; @@ -114,9 +113,6 @@ static void snapc_full_global_cmdline_recv(int status, void* cbdata); static void snapc_full_process_cmdline_request_cmd(int fd, short event, void *cbdata); -static void snapc_full_process_filem_xfer(void); - - static int snapc_full_establish_snapshot_dir(bool empty_metadata); /*** */ @@ -125,14 +121,12 @@ static int snapc_full_global_notify_checkpoint(orte_jobid_t jobid, opal_crs_base_ckpt_options_t *options); static int orte_snapc_full_global_set_job_ckpt_info( orte_jobid_t jobid, int ckpt_state, - char *ckpt_snapshot_ref, - char *ckpt_snapshot_loc, + orte_sstore_base_handle_t handle, bool quick, opal_crs_base_ckpt_options_t *options); int global_coord_job_state_update(orte_jobid_t jobid, int job_ckpt_state, - char **job_ckpt_snapshot_ref, - char **job_ckpt_snapshot_loc, + orte_sstore_base_handle_t handle, opal_crs_base_ckpt_options_t *options); static void snapc_full_process_job_update_cmd(orte_process_name_t* sender, opal_buffer_t* buffer, @@ -141,43 +135,101 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, opal_buffer_t* buffer, bool quick); static orte_snapc_full_orted_snapshot_t *find_orted_snapshot(orte_process_name_t *name ); -static orte_snapc_base_local_snapshot_t *find_orted_app_snapshot(orte_snapc_full_orted_snapshot_t *orted_snapshot, - orte_process_name_t *name); - -static int snapc_full_start_filem(orte_snapc_full_orted_snapshot_t *orted_snapshot); -static int snapc_full_wait_filem(void); static int snapc_full_global_get_min_state(void); static int write_out_global_metadata(void); +static int orte_snapc_full_global_reset_coord(void); + +/* + * Timer stuff + */ +static void snapc_full_set_time(int idx); +static void snapc_full_display_all_timers(void); +static void snapc_full_display_recovered_timers(void); +static void snapc_full_clear_timers(void); + +static double snapc_full_get_time(void); +static void snapc_full_display_indv_timer_core(double diff, char *str); + +#define SNAPC_FULL_TIMER_START 0 +#define SNAPC_FULL_TIMER_RUNNING 1 +#define SNAPC_FULL_TIMER_FIN_LOCAL 2 +#define SNAPC_FULL_TIMER_SS_SYNC 3 +#define SNAPC_FULL_TIMER_ESTABLISH 4 +#define SNAPC_FULL_TIMER_RECOVERED 5 +#define SNAPC_FULL_TIMER_MAX 6 + +static double timer_start[SNAPC_FULL_TIMER_MAX]; + +#define SNAPC_FULL_CLEAR_TIMERS() \ + { \ + if(OPAL_UNLIKELY(orte_snapc_full_timing_enabled)) { \ + snapc_full_clear_timers(); \ + } \ + } + +#define SNAPC_FULL_SET_TIMER(idx) \ + { \ + if(OPAL_UNLIKELY(orte_snapc_full_timing_enabled)) { \ + snapc_full_set_time(idx); \ + } \ + } + +#define SNAPC_FULL_DISPLAY_ALL_TIMERS() \ + { \ + if(OPAL_UNLIKELY(orte_snapc_full_timing_enabled)) { \ + snapc_full_display_all_timers(); \ + } \ + } +#define SNAPC_FULL_DISPLAY_RECOVERED_TIMER() \ + { \ + if(OPAL_UNLIKELY(orte_snapc_full_timing_enabled)) { \ + snapc_full_display_recovered_timers(); \ + } \ + } + +/* + * Progress + */ +static void snapc_full_report_progress(orte_snapc_full_orted_snapshot_t *orted_snapshot, + int total, + int min_state); +static int report_progress_cur_loc_finished = 0; +static double report_progress_last_reported_loc_finished = 0; +#define SNAPC_FULL_REPORT_PROGRESS(orted, total, min_state) \ + { \ + if(OPAL_UNLIKELY(orte_snapc_full_progress_meter > 0)) { \ + snapc_full_report_progress(orted, total, min_state); \ + } \ + } + /************************ * Function Definitions ************************/ -int global_coord_init(void) { - +int global_coord_init(void) +{ current_global_jobid = ORTE_JOBID_INVALID; orte_snapc_base_snapshot_seq_number = -1; - current_options = OBJ_NEW(opal_crs_base_ckpt_options_t); + SNAPC_FULL_CLEAR_TIMERS(); return ORTE_SUCCESS; } -int global_coord_finalize(void) { - +int global_coord_finalize(void) +{ current_global_jobid = ORTE_JOBID_INVALID; orte_snapc_base_snapshot_seq_number = -1; - if( NULL != current_options ) { - OBJ_RELEASE(current_options); - current_options = NULL; - } + SNAPC_FULL_CLEAR_TIMERS(); return ORTE_SUCCESS; } int global_coord_setup_job(orte_jobid_t jobid) { int ret, exit_status = ORTE_SUCCESS; + orte_job_t *jdata = NULL; /* * Only allow one job at a time. @@ -193,9 +245,36 @@ int global_coord_setup_job(orte_jobid_t jobid) { OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) Setup job %s as the Global Coordinator\n", ORTE_JOBID_PRINT(jobid))); + + SNAPC_FULL_CLEAR_TIMERS(); + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_START); } /* Local Coordinator pass - Always happens after global coordinator pass */ else if ( jobid == current_global_jobid ) { + + /* look up job data object */ + if (NULL == (jdata = orte_get_job_data_object(current_global_jobid))) { + ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND); + return ORTE_ERR_NOT_FOUND; + } + + if( ORTE_JOB_STATE_RESTART == jdata->state ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) Restarting Job %s...", + ORTE_JOBID_PRINT(jobid))); + SNAPC_FULL_CLEAR_TIMERS(); + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_START); + + if( ORTE_SUCCESS != (ret = global_refresh_job_structs()) ) { + ORTE_ERROR_LOG(ret); + return ret; + } + if( ORTE_SNAPC_LOCAL_COORD_TYPE == (orte_snapc_coord_type & ORTE_SNAPC_LOCAL_COORD_TYPE) ) { + return local_coord_setup_job(jobid); + } + return ORTE_SUCCESS; + } + /* If there are no local children, do not become a local coordinator */ if( !global_coord_has_local_children ) { return ORTE_SUCCESS; @@ -252,10 +331,11 @@ int global_coord_setup_job(orte_jobid_t jobid) { /* * If requested pre-establish the global snapshot directory */ +#if 0 if(orte_snapc_base_establish_global_snapshot_dir) { opal_output(0, "Global) Error: Pre-establishment of snapshot directory currently not supported!"); ORTE_ERROR_LOG(ORTE_ERR_NOT_SUPPORTED); -#if 0 + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) Pre-establish the global snapshot directory\n")); if( ORTE_SUCCESS != (ret = snapc_full_establish_snapshot_dir(true))) { @@ -263,8 +343,8 @@ int global_coord_setup_job(orte_jobid_t jobid) { exit_status = ret; goto cleanup; } -#endif } +#endif OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) Finished setup of job %s ", @@ -280,6 +360,13 @@ int global_coord_release_job(orte_jobid_t jobid) { /* * Make sure we are not waiting on a checkpoint to complete */ + if( is_orte_checkpoint_connected ) { + if( ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(&orte_checkpoint_sender, + global_snapshot.ss_handle, + ORTE_SNAPC_CKPT_STATE_ERROR)) ) { + ORTE_ERROR_LOG(ret); + } + } /* * Clean up listeners @@ -302,26 +389,67 @@ int global_coord_release_job(orte_jobid_t jobid) { int global_coord_start_ckpt(orte_snapc_base_quiesce_t *datum) { int ret, exit_status = ORTE_SUCCESS; - - orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; - orte_snapc_base_local_snapshot_t *app_snapshot = NULL; - opal_list_item_t* orted_item = NULL; - opal_list_item_t* app_item = NULL; - orte_snapc_base_local_snapshot_t *vpid_snapshot = NULL; + orte_std_cntr_t i_proc; + orte_proc_t *proc = NULL; + orte_proc_t *new_proc = NULL; + opal_list_item_t *item = NULL; opal_crs_base_ckpt_options_t *options = NULL; + char *tmp_str = NULL; OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) Starting checkpoint (internally requested)")); orte_checkpoint_sender = orte_name_invalid; - /* Save Options */ - options = OBJ_NEW(opal_crs_base_ckpt_options_t); - opal_crs_base_copy_options(options, current_options); + /* + * If migrating + */ + if( datum->migrating ) { + currently_migrating = true; + if( NULL != migrating_procs ) { + while( NULL != (item = opal_list_remove_first(migrating_procs)) ) { + proc = (orte_proc_t*)item; + OBJ_RELEASE(proc); + } + } else { + migrating_procs = OBJ_NEW(opal_list_t); + } + + /* + * Copy over the procs into a list + */ + for(i_proc = 0; i_proc < opal_pointer_array_get_size(&(datum->migrating_procs)); ++i_proc) { + proc = (orte_proc_t*)opal_pointer_array_get_item(&(datum->migrating_procs), i_proc); + if( NULL == proc ) { + continue; + } + + new_proc = OBJ_NEW(orte_proc_t); + new_proc->name.jobid = proc->name.jobid; + new_proc->name.vpid = proc->name.vpid; + new_proc->node = OBJ_NEW(orte_node_t); + new_proc->node->name = proc->node->name; + opal_list_append(migrating_procs, &new_proc->super); + OBJ_RETAIN(new_proc); + } + + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) SnapC Migrating Processes: (%d procs) [Updated]\n", + (int)opal_list_get_size(migrating_procs) )); + for (item = opal_list_get_first(migrating_procs); + item != opal_list_get_end(migrating_procs); + item = opal_list_get_next(item)) { + new_proc = (orte_proc_t*)item; + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "\t\"%s\" [%s]\n", + ORTE_NAME_PRINT(&new_proc->name),new_proc->node->name)); + } + } /************************* * Kick off the checkpoint (local coord will release the processes) *************************/ + options = OBJ_NEW(opal_crs_base_ckpt_options_t); if( ORTE_SUCCESS != (ret = snapc_full_global_checkpoint(options) ) ) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -331,8 +459,10 @@ int global_coord_start_ckpt(orte_snapc_base_quiesce_t *datum) /* * Wait for checkpoint to locally finish on all nodes */ - while(current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL && - current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_FINISHED && + while(((currently_migrating && current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_MIGRATING) || + (!currently_migrating && current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL)) && + current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ESTABLISHED && + current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_RECOVERED && current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ERROR && current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_NONE ) { opal_progress(); @@ -342,36 +472,26 @@ int global_coord_start_ckpt(orte_snapc_base_quiesce_t *datum) * Update the quiesce structure with the handle */ datum->snapshot = OBJ_NEW(orte_snapc_base_global_snapshot_t); - datum->snapshot->reference_name = strdup(global_snapshot.reference_name); - datum->snapshot->local_location = strdup(global_snapshot.local_location); - datum->snapshot->seq_num = orte_snapc_base_snapshot_seq_number; - datum->epoch = orte_snapc_base_snapshot_seq_number; - /* Copy the snapshot information */ - for(orted_item = opal_list_get_first(&(global_snapshot.local_snapshots)); - orted_item != opal_list_get_end(&(global_snapshot.local_snapshots)); - orted_item = opal_list_get_next(orted_item) ) { - orted_snapshot = (orte_snapc_full_orted_snapshot_t*)orted_item; - - if( ORTE_SNAPC_CKPT_STATE_ERROR == orted_snapshot->state ) { - continue; - } - - for(app_item = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); - app_item != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); - app_item = opal_list_get_next(app_item) ) { - app_snapshot = (orte_snapc_base_local_snapshot_t*)app_item; - - vpid_snapshot = OBJ_NEW(orte_snapc_base_local_snapshot_t); - vpid_snapshot->process_name.jobid = app_snapshot->process_name.jobid; - vpid_snapshot->process_name.vpid = app_snapshot->process_name.vpid; - vpid_snapshot->reference_name = strdup(app_snapshot->reference_name); - vpid_snapshot->local_location = strdup(app_snapshot->local_location); - - opal_list_append(&(datum->snapshot->local_snapshots), &(vpid_snapshot->super)); - } + datum->ss_handle = global_snapshot.ss_handle; + datum->ss_snapshot = OBJ_NEW(orte_sstore_base_global_snapshot_info_t); + if( ORTE_SUCCESS != (ret = orte_sstore.request_global_snapshot_data(&(datum->ss_handle), datum->ss_snapshot)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; } - + + /* JJH Is the snapc structure useful with the sstore structure ??? */ + orte_sstore.get_attr(global_snapshot.ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_SEQ, + &tmp_str); + datum->epoch = atoi(tmp_str); + + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + cleanup: if( NULL != options ) { OBJ_RELEASE(options); @@ -384,11 +504,39 @@ int global_coord_start_ckpt(orte_snapc_base_quiesce_t *datum) int global_coord_end_ckpt(orte_snapc_base_quiesce_t *datum) { int ret, exit_status = ORTE_SUCCESS; + opal_list_item_t* item = NULL; OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Global) Finishing checkpoint (internally requested)")); + "Global) Finishing checkpoint (internally requested) [%3d]", + current_job_ckpt_state)); - while(current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_FINISHED && + if( currently_migrating ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) End Ckpt: Flush the modex cached data\n")); + if (ORTE_SUCCESS != (ret = orte_grpcomm.purge_proc_attrs())) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + orte_grpcomm.finalize(); + if (ORTE_SUCCESS != (ret = orte_grpcomm.init())) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_ESTABLISH); + if( ORTE_SUCCESS != (ret = orte_snapc_full_global_set_job_ckpt_info(current_global_jobid, + ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL, + global_snapshot.ss_handle, + true, NULL) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + while(current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_RECOVERED && current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ERROR && current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_NONE ) { opal_progress(); @@ -407,7 +555,25 @@ int global_coord_end_ckpt(orte_snapc_base_quiesce_t *datum) "Global) Finished checkpoint (internally requested) [%d]", current_job_ckpt_state)); + if( currently_migrating ) { + current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_NONE; + cleanup_on_establish = false; + + report_progress_cur_loc_finished = 0; + report_progress_last_reported_loc_finished = 0; + } + cleanup: + + currently_migrating = false; + if( NULL != migrating_procs ) { + while( NULL != (item = opal_list_remove_first(migrating_procs)) ) { + OBJ_RELEASE(item); + } + OBJ_RELEASE(migrating_procs); + migrating_procs = NULL; + } + return exit_status; } @@ -418,7 +584,7 @@ static int global_init_job_structs(void) { orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; orte_snapc_base_local_snapshot_t *app_snapshot = NULL; - orte_node_t **nodes = NULL; + orte_node_t *cur_node = NULL; orte_job_map_t *map = NULL; orte_job_t *jdata = NULL; orte_proc_t **procs = NULL; @@ -432,32 +598,34 @@ static int global_init_job_structs(void) } OBJ_CONSTRUCT(&global_snapshot, orte_snapc_base_global_snapshot_t); - /* JJH XXX global_snapshot.component_name = strdup(mca_snapc_full_component.super.base_version.mca_component_name);*/ map = jdata->map; - nodes = (orte_node_t**)map->nodes->addr; - for(i = 0; i < map->num_nodes; i++) { - procs = (orte_proc_t**)nodes[i]->procs->addr; + for (i=0; i < map->nodes->size; i++) { + if (NULL == (cur_node = (orte_node_t*)opal_pointer_array_get_item(map->nodes, i))) { + continue; + } + + procs = (orte_proc_t**)cur_node->procs->addr; OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) [%d] Found Daemon %s with %d procs", - i, ORTE_NAME_PRINT(&(nodes[i]->daemon->name)), nodes[i]->num_procs)); + i, ORTE_NAME_PRINT(&(cur_node->daemon->name)), cur_node->num_procs)); orted_snapshot = OBJ_NEW(orte_snapc_full_orted_snapshot_t); - orted_snapshot->process_name.jobid = nodes[i]->daemon->name.jobid; - orted_snapshot->process_name.vpid = nodes[i]->daemon->name.vpid; + orted_snapshot->process_name.jobid = cur_node->daemon->name.jobid; + orted_snapshot->process_name.vpid = cur_node->daemon->name.vpid; if( orted_snapshot->process_name.jobid == ORTE_PROC_MY_NAME->jobid && orted_snapshot->process_name.vpid == ORTE_PROC_MY_NAME->vpid ) { global_coord_has_local_children = true; } - for(p = 0; p < nodes[i]->num_procs; ++p) { + for(p = 0; p < cur_node->num_procs; ++p) { OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) \t [%d] Found Process %s on Daemon %s", - p, ORTE_NAME_PRINT(&(procs[p]->name)), ORTE_NAME_PRINT(&(nodes[i]->daemon->name)) )); + p, ORTE_NAME_PRINT(&(procs[p]->name)), ORTE_NAME_PRINT(&(cur_node->daemon->name)) )); app_snapshot = OBJ_NEW(orte_snapc_base_local_snapshot_t); @@ -479,10 +647,13 @@ static int global_refresh_job_structs(void) orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; orte_snapc_base_local_snapshot_t *app_snapshot = NULL; opal_list_item_t* orted_item = NULL; - orte_node_t **nodes = NULL; + opal_list_item_t* app_item = NULL; + opal_list_item_t* item = NULL; + orte_node_t *cur_node = NULL; orte_job_map_t *map = NULL; orte_job_t *jdata = NULL; orte_proc_t **procs = NULL; + orte_proc_t *new_proc = NULL; orte_std_cntr_t i = 0; orte_vpid_t p = 0; bool found = false; @@ -493,17 +664,97 @@ static int global_refresh_job_structs(void) return ORTE_ERR_NOT_FOUND; } + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) Refreshing Job Structures... [%3d]", + current_job_ckpt_state)); + + if( NULL != migrating_procs ) { + for (item = opal_list_get_first(migrating_procs); + item != opal_list_get_end(migrating_procs); + item = opal_list_get_next(item)) { + new_proc = (orte_proc_t*)item; + + /* + * Look through all daemons + */ + found = false; + for(orted_item = opal_list_get_first(&(global_snapshot.local_snapshots)); + orted_item != opal_list_get_end(&(global_snapshot.local_snapshots)); + orted_item = opal_list_get_next(orted_item) ) { + orted_snapshot = (orte_snapc_full_orted_snapshot_t*)orted_item; + + /* + * Look through all processes tracked by this daemon + */ + for(app_item = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); + app_item != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); + app_item = opal_list_get_next(app_item) ) { + app_snapshot = (orte_snapc_base_local_snapshot_t*)app_item; + + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, + &(new_proc->name), + &(app_snapshot->process_name) )) { + found = true; + opal_list_remove_item(&(orted_snapshot->super.local_snapshots), app_item); + break; + } + } + + if( found ) { + break; + } + } + } + } + + /* + * First make sure that all of the orted's have the proper number of + * children, if no children, then stop tracking. + */ map = jdata->map; - nodes = (orte_node_t**)map->nodes->addr; + for(orted_item = opal_list_get_first(&(global_snapshot.local_snapshots)); + orted_item != opal_list_get_end(&(global_snapshot.local_snapshots)); + orted_item = opal_list_get_next(orted_item) ) { + orted_snapshot = (orte_snapc_full_orted_snapshot_t*)orted_item; + + /* Make sure this orted is in the map */ + found = false; + for (i=0; i < map->nodes->size; i++) { + if (NULL == (cur_node = (orte_node_t*)opal_pointer_array_get_item(map->nodes, i))) { + continue; + } + + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, + &(cur_node->daemon->name), + &(orted_snapshot->process_name) )) { + found = true; + break; + } + } + /* If not, then remove all processes, keep ref. we might reuse it later */ + if( !found ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) Found Empty Daemon %s not in map (Refresh)", + ORTE_NAME_PRINT(&(orted_snapshot->process_name)) )); + while( NULL != (item = opal_list_remove_first(&(orted_snapshot->super.local_snapshots))) ) { + OBJ_RELEASE(item); + } + } + } /* * Look for new nodes */ - for(i = 0; i < map->num_nodes; i++) { - procs = (orte_proc_t**)nodes[i]->procs->addr; + for (i=0; i < map->nodes->size; i++) { + if (NULL == (cur_node = (orte_node_t*)opal_pointer_array_get_item(map->nodes, i))) { + continue; + } + + procs = (orte_proc_t**)cur_node->procs->addr; /* - * See if we are already tracking it (if so skip) + * See if we are already tracking it, if so refresh it + * (This daemon could have been restarted, and processes migrated back to it) */ found = false; for(orted_item = opal_list_get_first(&(global_snapshot.local_snapshots)); @@ -512,33 +763,64 @@ static int global_refresh_job_structs(void) orted_snapshot = (orte_snapc_full_orted_snapshot_t*)orted_item; if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, - &(nodes[i]->daemon->name), + &(cur_node->daemon->name), &(orted_snapshot->process_name) )) { found = true; break; } } if( found ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) [%d] Found Daemon %s with %d procs (Refresh)", + i, ORTE_NAME_PRINT(&(cur_node->daemon->name)), cur_node->num_procs)); + + /* Remove all old processes */ + while( NULL != (item = opal_list_remove_first(&(orted_snapshot->super.local_snapshots))) ) { + OBJ_RELEASE(item); + } + + /* Add back new processes (a bit of overkill, sure, but it works) */ + for(p = 0; p < cur_node->num_procs; ++p) { + if( NULL == procs[p] ) { + continue; + } + + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) \t [%d] Found Process %s on Daemon %s", + p, ORTE_NAME_PRINT(&(procs[p]->name)), ORTE_NAME_PRINT(&(cur_node->daemon->name)) )); + + app_snapshot = OBJ_NEW(orte_snapc_base_local_snapshot_t); + + app_snapshot->process_name.jobid = procs[p]->name.jobid; + app_snapshot->process_name.vpid = procs[p]->name.vpid; + + opal_list_append(&(orted_snapshot->super.local_snapshots), &(app_snapshot->super)); + } + continue; } OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) [%d] Found Daemon %s with %d procs", - i, ORTE_NAME_PRINT(&(nodes[i]->daemon->name)), nodes[i]->num_procs)); + i, ORTE_NAME_PRINT(&(cur_node->daemon->name)), cur_node->num_procs)); orted_snapshot = OBJ_NEW(orte_snapc_full_orted_snapshot_t); - orted_snapshot->process_name.jobid = nodes[i]->daemon->name.jobid; - orted_snapshot->process_name.vpid = nodes[i]->daemon->name.vpid; + orted_snapshot->process_name.jobid = cur_node->daemon->name.jobid; + orted_snapshot->process_name.vpid = cur_node->daemon->name.vpid; if( orted_snapshot->process_name.jobid == ORTE_PROC_MY_NAME->jobid && orted_snapshot->process_name.vpid == ORTE_PROC_MY_NAME->vpid ) { global_coord_has_local_children = true; } - for(p = 0; p < nodes[i]->num_procs; ++p) { + for(p = 0; p < cur_node->num_procs; ++p) { + if( NULL == procs[p] ) { + continue; + } + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Global) \t [%d] Found Process %s on Daemon %s", - p, ORTE_NAME_PRINT(&(procs[p]->name)), ORTE_NAME_PRINT(&(nodes[i]->daemon->name)) )); + p, ORTE_NAME_PRINT(&(procs[p]->name)), ORTE_NAME_PRINT(&(cur_node->daemon->name)) )); app_snapshot = OBJ_NEW(orte_snapc_base_local_snapshot_t); @@ -548,7 +830,6 @@ static int global_refresh_job_structs(void) opal_list_append(&(orted_snapshot->super.local_snapshots), &(app_snapshot->super)); } - opal_list_append(&global_snapshot.local_snapshots, &(orted_snapshot->super.super)); } @@ -782,8 +1063,25 @@ static void snapc_full_process_cmdline_request_cmd(int fd, short event, void *cb goto cleanup; } - /* Save Options */ - opal_crs_base_copy_options(options, current_options); + orte_checkpoint_sender = *sender; + is_orte_checkpoint_connected = true; + + /* + * If the application is not ready for a checkpoint, + * then send back an error. + */ + if( !is_app_checkpointable ) { + OPAL_OUTPUT_VERBOSE((1, mca_snapc_full_component.super.output_handle, + "Global) request_cmd(): Checkpointing currently disabled, rejecting request")); + if( ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(&orte_checkpoint_sender, + 0, + ORTE_SNAPC_CKPT_STATE_ERROR))) { + ORTE_ERROR_LOG(ret); + } + orte_checkpoint_sender = orte_name_invalid; + is_orte_checkpoint_connected = false; + goto cleanup; + } /* * If the jobid was specified, and does not match the current job, then fail @@ -799,11 +1097,9 @@ static void snapc_full_process_cmdline_request_cmd(int fd, short event, void *cb /************************* * Kick off the checkpoint *************************/ - orte_checkpoint_sender = *sender; - is_orte_checkpoint_connected = true; - if(orte_snapc_full_timing_enabled) { - timer_start = get_time(); - } + SNAPC_FULL_CLEAR_TIMERS(); + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_START); + if( ORTE_SUCCESS != (ret = snapc_full_global_checkpoint(options) ) ) { ORTE_ERROR_LOG(ret); goto cleanup; @@ -887,46 +1183,43 @@ static void snapc_full_process_orted_request_cmd(int fd, short event, void *cbda snapc_full_process_orted_update_cmd(&(mev->sender), mev->buffer, false); break; - case ORTE_SNAPC_FULL_START_CKPT_CMD: + case ORTE_SNAPC_FULL_RESTART_PROC_INFO: OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Global) Command: Start Checkpoint")); - - snapc_full_process_start_ckpt_cmd(&(mev->sender), mev->buffer); + "Global) Command: Update hostname/pid associations")); + + snapc_full_process_restart_proc_info_cmd(&(mev->sender), mev->buffer); break; - case ORTE_SNAPC_FULL_END_CKPT_CMD: + case ORTE_SNAPC_FULL_REQUEST_OP_CMD: OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Global) Command: End Checkpoint")); + "Global) Command: Request Op")); - snapc_full_process_end_ckpt_cmd(&(mev->sender), mev->buffer); + snapc_full_process_request_op_cmd(&(mev->sender), mev->buffer); break; default: ORTE_ERROR_LOG(ORTE_ERR_VALUE_OUT_OF_BOUNDS); } - /* We need to wait for the last notification to start the waiting loop - * if we do not then we could get stuck in a recursive stack. - */ - --num_inside; - if( wait_all_xfer && num_inside <= 0) { - wait_all_xfer = false; - snapc_full_process_filem_xfer(); - } - cleanup: /* release the message event */ OBJ_RELEASE(mev); return; } -static void snapc_full_process_start_ckpt_cmd(orte_process_name_t* sender, +static void snapc_full_process_request_op_cmd(orte_process_name_t* sender, opal_buffer_t* sbuffer) { int ret; orte_std_cntr_t count = 1; orte_jobid_t jobid; + int op_event, op_state; opal_crs_base_ckpt_options_t *options = NULL; + opal_buffer_t buffer; + orte_snapc_full_cmd_flag_t command = ORTE_SNAPC_FULL_REQUEST_OP_CMD; + int seq_num = -1, i; + char * global_handle = NULL, *tmp_str = NULL; + orte_snapc_base_request_op_t *datum = NULL; orte_checkpoint_sender = orte_name_invalid; @@ -936,50 +1229,326 @@ static void snapc_full_process_start_ckpt_cmd(orte_process_name_t* sender, goto cleanup; } - /* Save Options */ - options = OBJ_NEW(opal_crs_base_ckpt_options_t); - opal_crs_base_copy_options(options, current_options); - - /************************* - * Kick off the checkpoint (local coord will release the processes) - *************************/ - if( ORTE_SUCCESS != (ret = snapc_full_global_checkpoint(options) ) ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &op_event, &count, OPAL_INT))) { ORTE_ERROR_LOG(ret); goto cleanup; } + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Op Code %2d\n", + op_event)); + + /************************************ + * Application have been initialized, and are ready for checkpointing + ************************************/ + if( ORTE_SNAPC_OP_INIT == op_event ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Checkpointing Enabled (%2d)\n", + op_event)); + is_app_checkpointable = true; + } + /************************************ + * Application is finalizing, and no longer ready for checkpointing. + ************************************/ + else if( ORTE_SNAPC_OP_FIN == op_event ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Checkpointing Disabled (%2d)\n", + op_event)); + is_app_checkpointable = false; + + /* + * Wait for any ongoing checkpoints to finish + */ + if( current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ERROR && + current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_NONE ) { + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Wait for ongoing checkpoint to complete...")); + while( current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ERROR && + current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_NONE ) { + opal_progress(); + } + } + + /* + * Tell application that it is now ok to finailze + */ + OPAL_OUTPUT_VERBOSE((3, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Send Finalize ACK to the job")); + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + op_event = ORTE_SNAPC_OP_FIN_ACK; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &op_event, 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(sender, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + OBJ_DESTRUCT(&buffer); + } + /************************************ + * Start a checkpoint operation + ************************************/ + else if( ORTE_SNAPC_OP_CHECKPOINT == op_event ) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Starting checkpoint (%2d)\n", + op_event)); + + options = OBJ_NEW(opal_crs_base_ckpt_options_t); + if( ORTE_SUCCESS != (ret = snapc_full_global_checkpoint(options) ) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Wait for the operation to complete + */ + while( current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ERROR && + current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_NONE ) { + opal_progress(); + } + + if( ORTE_SNAPC_CKPT_STATE_ERROR == current_job_ckpt_state ) { + op_state = -1; + } else { + op_state = 0; + } + + /* + * Tell the sender that the operation is finished + */ + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &op_event, 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &op_state, 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(sender, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + OBJ_DESTRUCT(&buffer); + } + /************************************ + * Start the Restart operation + ************************************/ + else if( ORTE_SNAPC_OP_RESTART == op_event ) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Starting restart (%2d)\n", + op_event)); + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &seq_num, &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &global_handle, &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Kick off the restart + */ + if( ORTE_SUCCESS != (ret = orte_errmgr_base_restart_job(current_global_jobid, global_handle, seq_num) ) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + } + /************************************ + * Start the Migration operation + ************************************/ + else if( ORTE_SNAPC_OP_MIGRATE == op_event ) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Starting migration (%2d)\n", + op_event)); + + datum = OBJ_NEW(orte_snapc_base_request_op_t); + + /* + * Unpack migration information + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &(datum->mig_num), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + datum->mig_vpids = malloc(sizeof(int) * datum->mig_num); + datum->mig_host_pref = malloc(sizeof(char) * datum->mig_num * OPAL_MAX_PROCESSOR_NAME); + datum->mig_vpid_pref = malloc(sizeof(int) * datum->mig_num); + datum->mig_off_node = malloc(sizeof(int) * datum->mig_num); + + for( i = 0; i < datum->mig_num; ++i ) { + (datum->mig_vpids)[i] = 0; + (datum->mig_host_pref)[i][0] = '\0'; + (datum->mig_vpid_pref)[i] = 0; + (datum->mig_off_node)[i] = (int)false; + } + + for( i = 0; i < datum->mig_num; ++i ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &((datum->mig_vpids)[i]), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if(NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &tmp_str, &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + strncpy( ((datum->mig_host_pref)[i]), tmp_str, OPAL_MAX_PROCESSOR_NAME); + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &((datum->mig_vpid_pref)[i]), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &((datum->mig_off_node)[i]), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, + "Global) Migration %3d/%3d: Received Rank %3d - Requested <%s> (%3d) %c\n", + datum->mig_num, i, + (datum->mig_vpids)[i], + (datum->mig_host_pref)[i], + (datum->mig_vpid_pref)[i], + (OPAL_INT_TO_BOOL((datum->mig_off_node)[i]) ? 'T' : 'F') + )); + } + + /* + * Kick off the migration + */ + OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, + "Global) ------ Kick Off Migration -----")); + if( ORTE_SUCCESS != (ret = orte_errmgr_base_migrate_job(current_global_jobid, datum) ) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Tell the sender that the operation is finished + */ + OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, + "Global) ------ Finished Migration. Release processes (%15s )-----", + ORTE_NAME_PRINT(sender) )); + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &op_event, 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + op_state = 0; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &op_state, 1, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(sender, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + OBJ_DESTRUCT(&buffer); + + OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, + "Global) ------ Finished Migration. Released processes (%15s )-----", + ORTE_NAME_PRINT(sender) )); + } + /************************************ + * Start the Quiesce operation + ************************************/ + else if( ORTE_SNAPC_OP_QUIESCE_START == op_event) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Starting quiesce (%2d)\n", + op_event)); + + options = OBJ_NEW(opal_crs_base_ckpt_options_t); + options->inc_prep_only = true; + if( ORTE_SUCCESS != (ret = snapc_full_global_checkpoint(options) ) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Wait for quiescence + */ + while( current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ERROR && + current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_INC_PREPED ) { + opal_progress(); + } + + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Quiesce_start finished(%2d)\n", + op_event)); + } + /************************************ + * End the Quiesce operation + ************************************/ + else if( ORTE_SNAPC_OP_QUIESCE_END == op_event) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Ending quiesce (%2d)\n", + op_event)); + + /* + * Wait for the checkpoint operation to finish + */ + while( current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_ERROR && + current_job_ckpt_state != ORTE_SNAPC_CKPT_STATE_NONE ) { + opal_progress(); + } + + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) process_request_op(): Quiesce_end finished(%2d)\n", + op_event)); + } + + cleanup: if( NULL != options ) { OBJ_RELEASE(options); options = NULL; } - return; -} - -static void snapc_full_process_end_ckpt_cmd(orte_process_name_t* sender, - opal_buffer_t* sbuffer) -{ - int ret, exit_status = ORTE_SUCCESS; - orte_std_cntr_t count = 1; - orte_jobid_t jobid; - int local_epoch; - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &jobid, &count, ORTE_JOBID))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; + if(NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; } - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(sbuffer, &local_epoch, &count, OPAL_INT))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - cleanup: return; } @@ -989,11 +1558,9 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, { int ret, exit_status = ORTE_SUCCESS; orte_std_cntr_t count; - orte_process_name_t remote_proc; - size_t num_procs, i; int remote_ckpt_state; - char *remote_ckpt_ref = NULL, *remote_ckpt_loc = NULL; - char *agent_crs = NULL; + opal_list_item_t* item = NULL; + opal_list_item_t* aitem = NULL; orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; orte_snapc_base_local_snapshot_t *app_snapshot = NULL; int loc_min_state; @@ -1018,12 +1585,9 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, * - state * Unpack the data (long) * - state - * - CRS Component * - # procs * - Foreach proc * - process name - * - ckpt_ref - * - ckpt_loc */ count = 1; if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &remote_ckpt_state, &count, OPAL_INT))) { @@ -1039,83 +1603,21 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, free(state_str); state_str = NULL; + /* JJH: Though there is currently no additional information sent in a long + * message versus a small message, keep this logic so that in the + * future it can be easily reused without substantially modifying + * the component. + */ if( quick ) { exit_status = ORTE_SUCCESS; goto post_process; } - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &agent_crs, &count, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - if( NULL != orted_snapshot->opal_crs ) { - free( orted_snapshot->opal_crs ); - } - orted_snapshot->opal_crs = strdup(agent_crs); - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) CRS: %s\n", - orted_snapshot->opal_crs)); - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &num_procs, &count, OPAL_SIZE))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - for(i = 0; i < num_procs; ++i ) { - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &remote_proc, &count, ORTE_NAME))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - app_snapshot = find_orted_app_snapshot(orted_snapshot, &remote_proc); - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Process: %s\n", - ORTE_NAME_PRINT(&remote_proc) )); - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &remote_ckpt_ref, &count, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - if( NULL != app_snapshot->reference_name ) { - free( app_snapshot->reference_name ); - } - app_snapshot->reference_name = strdup(remote_ckpt_ref); - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Ref: %s\n", - app_snapshot->reference_name )); - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &remote_ckpt_loc, &count, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - if( NULL != app_snapshot->remote_location ) { - free( app_snapshot->remote_location ); - } - app_snapshot->remote_location = strdup(remote_ckpt_loc); - if( NULL == app_snapshot->local_location ) { - app_snapshot->local_location = strdup(orte_snapc_base_global_snapshot_loc); - } - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) R Loc: %s\n", - app_snapshot->remote_location )); - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) L Loc: %s\n", - app_snapshot->local_location )); - - } - post_process: loc_min_state = snapc_full_global_get_min_state(); + SNAPC_FULL_REPORT_PROGRESS(orted_snapshot, current_total_orteds, loc_min_state); + /* * Notify the orte-checkpoint command once we have everyone running. * No need to broadcast this to everyone since they already know. @@ -1124,10 +1626,11 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, ORTE_SNAPC_CKPT_STATE_RUNNING != current_job_ckpt_state) { current_job_ckpt_state = loc_min_state; + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_RUNNING); + if( is_orte_checkpoint_connected && ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(&orte_checkpoint_sender, - global_snapshot.reference_name, - global_snapshot.seq_num, + global_snapshot.ss_handle, current_job_ckpt_state)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -1135,6 +1638,17 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, } } + /* + * If we are just prep'ing the INC, then acknowledge the state change + */ + if( ORTE_SNAPC_CKPT_STATE_INC_PREPED == loc_min_state && + ORTE_SNAPC_CKPT_STATE_INC_PREPED > current_job_ckpt_state) { + current_job_ckpt_state = loc_min_state; + + OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, + "Global) All Processes have finished the INC prep!\n")); + } + /* * Notify the orte-checkpoint command once we have everyone stopped. * No need to broadcast this to everyone since they already know. @@ -1148,8 +1662,7 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, if( is_orte_checkpoint_connected && ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(&orte_checkpoint_sender, - global_snapshot.reference_name, - global_snapshot.seq_num, + global_snapshot.ss_handle, current_job_ckpt_state)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -1160,30 +1673,18 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, is_orte_checkpoint_connected = false; /* - * Write out metadata + * Synchronize the checkpoint here */ write_out_global_metadata(); } /* - * if(all_orted == FINISHED_LOCAL) { - * xcast(FIN_LOCAL) - * if( !xfer ) { - * xcast(FIN) -- happens in job_state_update -- - * } - * } - * if(orted == FINISHED_LOCAL && xfer) { - * start_filem_xfer(); - * send(FIN) when finished with xfer - * } + * If all daemons have finished, let everyone know we are locally finished. */ - /* - * If all daemons have finished - */ - if( loc_min_state == ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL ) { - if(orte_snapc_full_timing_enabled) { - timer_local_done = get_time(); - } + if( ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL == loc_min_state && + ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL > current_job_ckpt_state) { + + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_FIN_LOCAL); if( ORTE_SNAPC_CKPT_STATE_NONE != current_job_ckpt_state ) { if( loc_min_state == current_job_ckpt_state) { @@ -1191,14 +1692,12 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, } } - /* - * If we know that there is no file transfer, just fast path the - * finished message, the local coordinator will know how to handle it. - */ - if( orte_snapc_base_store_in_place || orte_snapc_full_skip_filem) { - current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_FINISHED; - } else { - current_job_ckpt_state = loc_min_state; + if( currently_migrating ) { + write_out_global_metadata(); + current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_MIGRATING; + } + else { + current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL; } if( NULL != state_str ) { @@ -1213,8 +1712,29 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, if( ORTE_SUCCESS != (ret = orte_snapc_full_global_set_job_ckpt_info(current_global_jobid, current_job_ckpt_state, - NULL, NULL, true, - NULL) ) ) { + global_snapshot.ss_handle, + true, NULL) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Now that we have finished locally, + * - Write out the metadata + * - Sync the snapshot to SStore + * if we are stopping then we have already written out this data. + */ + if( !(global_snapshot.options->stop) && !currently_migrating ) { + write_out_global_metadata(); + } + + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_ESTABLISH); + + if( ORTE_SUCCESS != (ret = orte_snapc_full_global_set_job_ckpt_info(current_global_jobid, + ORTE_SNAPC_CKPT_STATE_ESTABLISHED, + global_snapshot.ss_handle, + true, NULL) ) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -1222,23 +1742,85 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, } /* - * If the process has finished the local checkpoint, start any transfers - * while the other daemons are reporting in. - * - * if(orted == FINISHED_LOCAL && xfer) { - * start_filem_xfer(); - * send(FIN) when finished with xfer - * } + * If all daemons have confirmed that their local proces are finished + * and we have finished establishing the checkpoint, + * then let the command line tool know and cleanup. */ - if( ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL == orted_snapshot->state ) { - if(!orte_snapc_base_store_in_place && !orte_snapc_full_skip_filem) { - /* Start the transfer of files while other daemons are reporting in */ - orted_snapshot->state = ORTE_SNAPC_CKPT_STATE_FILE_XFER; + if( ORTE_SNAPC_CKPT_STATE_RECOVERED == loc_min_state && + ORTE_SNAPC_CKPT_STATE_RECOVERED > current_job_ckpt_state ) { - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Starting FileM (%s)", - ORTE_NAME_PRINT(&orted_snapshot->process_name))); - if( ORTE_SUCCESS != (ret = snapc_full_start_filem(orted_snapshot) ) ) { + /* + * If this is a job restarting then we do something different + */ + if( current_job_ckpt_state == ORTE_SNAPC_CKPT_STATE_NONE ) { + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Global) Job has been successfully restarted")); + + /*current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_RECOVERED;*/ + + for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); + item != opal_list_get_end(&(global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; + + orted_snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; + + for(aitem = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); + aitem != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); + aitem = opal_list_get_next(aitem) ) { + app_snapshot = (orte_snapc_base_local_snapshot_t*)aitem; + + app_snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; + } + } + + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_RECOVERED); + SNAPC_FULL_DISPLAY_RECOVERED_TIMER(); + orte_snapc_base_has_recovered = true; + + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + /* + * If the checkpoint has not been established yet, then do not clear the + * snapshot structure just yet. + */ + if(ORTE_SNAPC_CKPT_STATE_ESTABLISHED != current_job_ckpt_state ) { + cleanup_on_establish = true; + } + + current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_RECOVERED; + + if( NULL != state_str ) { + free(state_str); + } + orte_snapc_ckpt_state_str(&state_str, current_job_ckpt_state); + OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, + "Global) Job State Changed: %d (%s)\n", + (int)current_job_ckpt_state, state_str )); + free(state_str); + state_str = NULL; + + /* + * Notify the orte-checkpoint command + */ + if( is_orte_checkpoint_connected && + ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(&orte_checkpoint_sender, + global_snapshot.ss_handle, + current_job_ckpt_state)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_RECOVERED); + + /* + * If the checkpoint has been established at this point, then cleanup. + */ + if( !cleanup_on_establish && ORTE_SNAPC_CKPT_STATE_RECOVERED == current_job_ckpt_state) { + if( ORTE_SUCCESS != (ret = orte_snapc_full_global_reset_coord()) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -1246,15 +1828,6 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, } } - /* - * If all of the daemons are currently transferring data, - * wait here until done. Then xcast(FIN) - */ - loc_min_state = snapc_full_global_get_min_state(); - if( ORTE_SNAPC_CKPT_STATE_FILE_XFER == loc_min_state ) { - wait_all_xfer = true; - } - cleanup: if( NULL != state_str ) { free(state_str); @@ -1264,45 +1837,58 @@ static int snapc_full_process_orted_update_cmd(orte_process_name_t* sender, return exit_status; } -static void snapc_full_process_filem_xfer(void) +static void snapc_full_process_restart_proc_info_cmd(orte_process_name_t* sender, + opal_buffer_t* buffer) { int ret; - char * state_str = NULL; + orte_std_cntr_t count; + size_t num_vpids = 0, i; + pid_t tmp_pid; + char * tmp_hostname = NULL; - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Wait for all FileM to complete")); - if( ORTE_SUCCESS != (ret = snapc_full_wait_filem() ) ) { - ORTE_ERROR_LOG(ret); + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &tmp_hostname, &count, OPAL_STRING))) { + opal_output(mca_snapc_full_component.super.output_handle, + "Global) vpid_assoc: Failed to unpack process Hostname from peer %s\n", + ORTE_NAME_PRINT(sender)); goto cleanup; } - if(orte_snapc_full_timing_enabled) { - timer_xfer_done = get_time(); - } - current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_FINISHED; - - orte_snapc_ckpt_state_str(&state_str, current_job_ckpt_state); - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Job State Changed: %d (%s) -- Done with Transfer of files\n", - (int)current_job_ckpt_state, state_str )); - - if( ORTE_SUCCESS != (ret = orte_snapc_full_global_set_job_ckpt_info(current_global_jobid, - current_job_ckpt_state, - NULL, NULL, true, - NULL) ) ) { - ORTE_ERROR_LOG(ret); + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &num_vpids, &count, OPAL_SIZE))) { + opal_output(mca_snapc_full_component.super.output_handle, + "Global) vpid_assoc: Failed to unpack num_vpids from peer %s\n", + ORTE_NAME_PRINT(sender)); goto cleanup; } + for(i = 0; i < num_vpids; ++i) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &tmp_pid, &count, OPAL_PID))) { + opal_output(mca_snapc_full_component.super.output_handle, + "Global) vpid_assoc: Failed to unpack process PID from peer %s\n", + ORTE_NAME_PRINT(sender)); + goto cleanup; + } + + global_coord_restart_proc_info(tmp_pid, tmp_hostname); + } + + /* stdout may be buffered by the C library so it needs to be flushed so + * that the debugger can read the process info. + */ + fflush(stdout); + cleanup: - if(NULL != state_str ){ - free(state_str); - state_str = NULL; - } - return; } +int global_coord_restart_proc_info(pid_t local_pid, char * local_hostname) +{ + printf("MPIR_debug_info) %s:%d\n", local_hostname, local_pid); + return 0; +} + static void snapc_full_process_job_update_cmd(orte_process_name_t* sender, opal_buffer_t* buffer, bool quick) @@ -1311,21 +1897,21 @@ static void snapc_full_process_job_update_cmd(orte_process_name_t* sender, orte_std_cntr_t count; orte_jobid_t jobid; int job_ckpt_state = ORTE_SNAPC_CKPT_STATE_NONE; - char *job_ckpt_snapshot_ref = NULL; - char *job_ckpt_snapshot_loc = NULL; - size_t loc_seq_num = 0; opal_crs_base_ckpt_options_t *options = NULL; + bool loc_migrating = false; + size_t loc_num_procs = 0; + orte_proc_t *proc = NULL; + size_t i; + orte_sstore_base_handle_t ss_handle; /* * Unpack the data (quick) * - jobid * - ckpt_state + * - sstore_handle * Unpack the data (long) * - jobid * - ckpt_state - * - snapshot reference - * - snapshot_location - * - local seq number * - ckpt_options */ count = 1; @@ -1343,22 +1929,7 @@ static void snapc_full_process_job_update_cmd(orte_process_name_t* sender, } if( !quick ) { - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &job_ckpt_snapshot_ref, &count, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &job_ckpt_snapshot_loc, &count, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &loc_seq_num, &count, OPAL_SIZE))) { + if (ORTE_SUCCESS != (ret = orte_sstore.unpack_handle(sender, buffer, &ss_handle)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -1373,14 +1944,39 @@ static void snapc_full_process_job_update_cmd(orte_process_name_t* sender, /* In this case we want to use the current_options that are cached * so that we do not have to send them every time. */ - opal_crs_base_copy_options(options, current_options); + opal_crs_base_copy_options(options, global_snapshot.options); + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(loc_migrating), &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( loc_migrating ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &loc_num_procs, &count, OPAL_SIZE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + for( i = 0; i < loc_num_procs; ++i ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &proc, &count, ORTE_NAME))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + /* JJH: Update local info as needed */ + } + } } if( ORTE_SUCCESS != (ret = global_coord_job_state_update(jobid, job_ckpt_state, - &job_ckpt_snapshot_ref, - &job_ckpt_snapshot_loc, - current_options) ) ) { + ss_handle, + global_snapshot.options) ) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -1397,38 +1993,43 @@ static void snapc_full_process_job_update_cmd(orte_process_name_t* sender, static int snapc_full_establish_snapshot_dir(bool empty_metadata) { - int ret; - char * global_snapshot_handle = NULL; + int idx = 0; + char *value = NULL; /********************* - * Generate the global snapshot directory, and unique global snapshot handle + * Contact the Stable Storage Framework to setup the storage directory *********************/ INC_SEQ_NUM(); - if( NULL == global_snapshot_handle ) { - orte_snapc_base_unique_global_snapshot_name(&global_snapshot_handle, getpid()); + orte_sstore.request_checkpoint_handle(&(global_snapshot.ss_handle), + orte_snapc_base_snapshot_seq_number, + current_global_jobid); + if( currently_migrating ) { + orte_sstore.set_attr(global_snapshot.ss_handle, + SSTORE_METADATA_GLOBAL_MIGRATING, + "1"); } + orte_sstore.register_handle(global_snapshot.ss_handle); - orte_snapc_base_get_global_snapshot_directory(&orte_snapc_base_global_snapshot_loc, global_snapshot_handle); - - global_snapshot.seq_num = orte_snapc_base_snapshot_seq_number; - global_snapshot.reference_name = strdup(global_snapshot_handle); - global_snapshot.local_location = opal_dirname(orte_snapc_base_global_snapshot_loc); - - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Setup Directory (seq = %d) (dir = %s)", - global_snapshot.seq_num, orte_snapc_base_global_snapshot_loc)); - - /* Creates the directory (with metadata files): - * /tmp/ompi_global_snapshot_PID.ckpt/seq_num + /* + * Save the AMCA parameter used into the metadata file */ - if( ORTE_SUCCESS != (ret = orte_snapc_base_init_global_snapshot_directory(global_snapshot.reference_name, empty_metadata))) { - ORTE_ERROR_LOG(ret); - return ret; + if( 0 > (idx = mca_base_param_find("mca", NULL, "base_param_file_prefix")) ) { + opal_show_help("help-orte-restart.txt", "amca_param_not_found", true); + } + if( 0 < idx ) { + mca_base_param_lookup_string(idx, &value); + orte_sstore.set_attr(global_snapshot.ss_handle, + SSTORE_METADATA_GLOBAL_AMCA_PARAM, + value); + + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Global) AMCA Parameter Preserved: %s", + value)); } - if( NULL != global_snapshot_handle ) { - free(global_snapshot_handle); - global_snapshot_handle = NULL; + if( NULL != value ) { + free(value); + value = NULL; } return ORTE_SUCCESS; @@ -1442,6 +2043,8 @@ static int snapc_full_global_checkpoint(opal_crs_base_ckpt_options_t *options) "Global) Checkpoint of job %s has been requested\n", ORTE_JOBID_PRINT(current_global_jobid))); + /* opal_output(0, "================> JJH Checkpoint Started"); */ + current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_REQUEST; /********************* @@ -1459,18 +2062,13 @@ static int snapc_full_global_checkpoint(opal_crs_base_ckpt_options_t *options) updated_job_to_running = false; if( is_orte_checkpoint_connected && ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(&orte_checkpoint_sender, - global_snapshot.reference_name, - global_snapshot.seq_num, - ORTE_SNAPC_CKPT_STATE_REQUEST) ) ) { + global_snapshot.ss_handle, + current_job_ckpt_state) ) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; } - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Using the checkpoint directory (%s)\n", - global_snapshot.reference_name)); - /********************** * Notify the Local Snapshot Coordinators of the checkpoint request **********************/ @@ -1493,12 +2091,15 @@ static int snapc_full_global_notify_checkpoint(orte_jobid_t jobid, int ret, exit_status = ORTE_SUCCESS; orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; opal_list_item_t* item = NULL; - char * global_dir = NULL; int ckpt_state; - orte_snapc_base_get_global_snapshot_directory(&global_dir, global_snapshot.reference_name); ckpt_state = ORTE_SNAPC_CKPT_STATE_PENDING; + /* + * Copy over the options + */ + opal_crs_base_copy_options(options, global_snapshot.options); + /* * Update the global structure */ @@ -1506,30 +2107,23 @@ static int snapc_full_global_notify_checkpoint(orte_jobid_t jobid, item != opal_list_get_end(&global_snapshot.local_snapshots); item = opal_list_get_next(item) ) { orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; - orted_snapshot->state = ckpt_state; - opal_crs_base_copy_options(options, orted_snapshot->options); + orted_snapshot->state = ckpt_state; } /* * Update the job state, and broadcast to all local daemons */ - orte_snapc_base_global_snapshot_loc = strdup(global_dir); if( ORTE_SUCCESS != (ret = orte_snapc_full_global_set_job_ckpt_info(jobid, ckpt_state, - global_snapshot.reference_name, - global_dir, - false, - options) ) ) { + global_snapshot.ss_handle, + false, options) ) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; } cleanup: - if( NULL != global_dir) - free(global_dir); - return exit_status; } @@ -1538,8 +2132,7 @@ static int snapc_full_global_notify_checkpoint(orte_jobid_t jobid, **********************************/ static int orte_snapc_full_global_set_job_ckpt_info( orte_jobid_t jobid, int ckpt_state, - char *ckpt_snapshot_ref, - char *ckpt_snapshot_loc, + orte_sstore_base_handle_t handle, bool quick, opal_crs_base_ckpt_options_t *options) { @@ -1547,6 +2140,9 @@ static int orte_snapc_full_global_set_job_ckpt_info( orte_jobid_t jobid, orte_snapc_full_cmd_flag_t command; opal_buffer_t buffer; char * state_str = NULL; + orte_proc_t *proc = NULL; + opal_list_item_t *item = NULL; + size_t num_procs; /* * Update all Local Coordinators (broadcast operation) @@ -1581,19 +2177,7 @@ static int orte_snapc_full_global_set_job_ckpt_info( orte_jobid_t jobid, goto process_msg; } - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &ckpt_snapshot_ref, 1, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &ckpt_snapshot_loc, 1, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &orte_snapc_base_snapshot_seq_number, 1, OPAL_SIZE))) { + if (ORTE_SUCCESS != (ret = orte_sstore.pack_handle(NULL, &buffer, handle))) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -1605,6 +2189,33 @@ static int orte_snapc_full_global_set_job_ckpt_info( orte_jobid_t jobid, goto cleanup; } + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(currently_migrating), 1, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( currently_migrating ) { + num_procs = opal_list_get_size(migrating_procs); + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &num_procs, 1, OPAL_SIZE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + for (item = opal_list_get_first(migrating_procs); + item != opal_list_get_end(migrating_procs); + item = opal_list_get_next(item)) { + proc = (orte_proc_t*)item; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(proc->name), 1, ORTE_NAME))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + } + process_msg: orte_snapc_ckpt_state_str(&state_str, ckpt_state); OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, @@ -1630,21 +2241,16 @@ static int orte_snapc_full_global_set_job_ckpt_info( orte_jobid_t jobid, } OBJ_DESTRUCT(&buffer); + return exit_status; } int global_coord_job_state_update(orte_jobid_t jobid, int job_ckpt_state, - char **job_ckpt_snapshot_ref, - char **job_ckpt_snapshot_loc, + orte_sstore_base_handle_t ss_handle, opal_crs_base_ckpt_options_t *options) { int ret, exit_status = ORTE_SUCCESS; - orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; - orte_snapc_base_local_snapshot_t *app_snapshot = NULL; - opal_list_item_t* item = NULL; - opal_list_item_t* aitem = NULL; - bool term_job = false; char * state_str = NULL; orte_snapc_ckpt_state_str(&state_str, job_ckpt_state); @@ -1660,8 +2266,7 @@ int global_coord_job_state_update(orte_jobid_t jobid, current_job_ckpt_state = job_ckpt_state; if( is_orte_checkpoint_connected && ORTE_SUCCESS != (ret = orte_snapc_base_global_coord_ckpt_update_cmd(&orte_checkpoint_sender, - global_snapshot.reference_name, - global_snapshot.seq_num, + global_snapshot.ss_handle, current_job_ckpt_state)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -1674,8 +2279,7 @@ int global_coord_job_state_update(orte_jobid_t jobid, if( ORTE_SNAPC_LOCAL_COORD_TYPE == (orte_snapc_coord_type & ORTE_SNAPC_LOCAL_COORD_TYPE) ) { if( ORTE_SUCCESS != (ret = local_coord_job_state_update(jobid, job_ckpt_state, - job_ckpt_snapshot_ref, - job_ckpt_snapshot_loc, + ss_handle, options)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -1684,102 +2288,24 @@ int global_coord_job_state_update(orte_jobid_t jobid, } /* - * If we have completed locally, and not transfering files - * then just finish the checkpoint operation. - * - * Otherwise the FIN is xcast'ed in process_orted_update_cmd() + * Process the cmd */ - if( ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL == job_ckpt_state ) { - if( orte_snapc_base_store_in_place || orte_snapc_full_skip_filem) { - if( ORTE_SUCCESS != (ret = orte_snapc_full_global_set_job_ckpt_info(current_global_jobid, - ORTE_SNAPC_CKPT_STATE_FINISHED, - NULL, NULL, true, options) ) ) { + if(ORTE_SNAPC_CKPT_STATE_ESTABLISHED == job_ckpt_state ) { + /* + * If the processes recovered before the checkpoint was established, + * then we need to cleanup here instead of in the recovery block + */ + if( cleanup_on_establish ) { + if( ORTE_SUCCESS != (ret = orte_snapc_full_global_reset_coord()) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; } } } - /* - * Once finished, then cleanup and finalize the global snapshot - */ - else if( ORTE_SNAPC_CKPT_STATE_FINISHED == job_ckpt_state || - ORTE_SNAPC_CKPT_STATE_ERROR == job_ckpt_state ) { - /* - * Write out metadata - * if we are stopping then we have already written out this data. - */ - if( ! (current_options->stop) ) { - write_out_global_metadata(); - } - - /* - * Clear globally cached options - */ - opal_crs_base_clear_options(current_options); - - /* - * Reset global data structures - */ - for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); - item != opal_list_get_end(&(global_snapshot.local_snapshots)); - item = opal_list_get_next(item) ) { - orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; - - orted_snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; - - if( orted_snapshot->options->term ) { - term_job = true; - } - opal_crs_base_clear_options(orted_snapshot->options); - - for(aitem = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); - aitem != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); - aitem = opal_list_get_next(aitem) ) { - app_snapshot = (orte_snapc_base_local_snapshot_t*)aitem; - - app_snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; - if( NULL != app_snapshot->reference_name ) { - free(app_snapshot->reference_name); - app_snapshot->reference_name = NULL; - } - if( NULL != app_snapshot->local_location ) { - free(app_snapshot->local_location); - app_snapshot->local_location = NULL; - } - if( NULL != app_snapshot->remote_location ) { - free(app_snapshot->remote_location); - app_snapshot->remote_location = NULL; - } - } - } - - if(orte_snapc_full_timing_enabled) { - timer_end = get_time(); - print_time(); - timer_start = 0; - timer_local_done = 0; - timer_xfer_done = 0; - timer_end = 0; - } - - /************************ - * Set up the Command Line listener again - *************************/ - is_orte_checkpoint_connected = false; - if( ORTE_SUCCESS != (ret = snapc_full_global_start_cmdline_listener() ) ){ - ORTE_ERROR_LOG(ret); - exit_status = ret; - } - - /******************************** - * Terminate the job if requested - * At this point the application should have already exited, but do this - * just to make doubly sure that the job is terminated. - *********************************/ - if( term_job ) { - orte_plm.terminate_job(jobid); - } + else if(ORTE_SNAPC_CKPT_STATE_ERROR == job_ckpt_state ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Error: Checkpoint failed!"); } /* * This should not happen, since this state is always handled locally @@ -1787,14 +2313,6 @@ int global_coord_job_state_update(orte_jobid_t jobid, else if(ORTE_SNAPC_CKPT_STATE_STOPPED == job_ckpt_state ) { ; } - /* - * This should not happen, since this state is always handled locally - */ - else if( ORTE_SNAPC_CKPT_STATE_FILE_XFER == job_ckpt_state ) { - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) JJH WARNING: job state = %d (FILE_XFER)", - job_ckpt_state)); - } /* * This should not happen, since we do not handle this case */ @@ -1815,15 +2333,17 @@ int global_coord_job_state_update(orte_jobid_t jobid, static int write_out_global_metadata(void) { - int ret; orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; - orte_snapc_base_local_snapshot_t *app_snapshot = NULL; opal_list_item_t* orted_item = NULL; - opal_list_item_t* app_item = NULL; OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, "Global) Updating Metadata")); + /* + * Check for an error + * JJH CLEANUP: Check might be good, but mostly unnecessary + * JJH: Do we want to pass this along to the SStore? Probably + */ for(orted_item = opal_list_get_first(&(global_snapshot.local_snapshots)); orted_item != opal_list_get_end(&(global_snapshot.local_snapshots)); orted_item = opal_list_get_next(orted_item) ) { @@ -1832,41 +2352,22 @@ static int write_out_global_metadata(void) if( ORTE_SNAPC_CKPT_STATE_ERROR == orted_snapshot->state ) { return ORTE_ERROR; } - - for(app_item = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); - app_item != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); - app_item = opal_list_get_next(app_item) ) { - app_snapshot = (orte_snapc_base_local_snapshot_t*)app_item; - - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Process Name: %s\n", - ORTE_NAME_PRINT(&app_snapshot->process_name) )); - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Reference : %s\n", - app_snapshot->reference_name)); - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) Location : %s\n", - app_snapshot->local_location)); - - if(ORTE_SUCCESS != (ret = orte_snapc_base_add_vpid_metadata(&app_snapshot->process_name, - global_snapshot.reference_name, - app_snapshot->reference_name, - app_snapshot->local_location, - orted_snapshot->opal_crs) ) ){ - ORTE_ERROR_LOG(ret); - return ret; - } - } - } - orte_snapc_base_finalize_metadata(global_snapshot.reference_name); + /* + * Sync the stable storage + */ + orte_sstore.sync(global_snapshot.ss_handle); + + SNAPC_FULL_SET_TIMER(SNAPC_FULL_TIMER_SS_SYNC); return ORTE_SUCCESS; } static orte_snapc_full_orted_snapshot_t *find_orted_snapshot(orte_process_name_t *name ) { + int ret; + orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; opal_list_item_t* item = NULL; @@ -1881,205 +2382,32 @@ static orte_snapc_full_orted_snapshot_t *find_orted_snapshot(orte_process_name_t } } + /* + * Refresh the job structure, and try again + */ + OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, + "Global) find_orted(%s) failed. Refreshing and trying again...", + ORTE_NAME_PRINT(name) )); + + if( ORTE_SUCCESS != (ret = global_refresh_job_structs()) ) { + ORTE_ERROR_LOG(ret); + return NULL; + } + + for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); + item != opal_list_get_end(&(global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; + + if( name->jobid == orted_snapshot->process_name.jobid && + name->vpid == orted_snapshot->process_name.vpid ) { + return orted_snapshot; + } + } + return NULL; } -static orte_snapc_base_local_snapshot_t *find_orted_app_snapshot(orte_snapc_full_orted_snapshot_t *orted_snapshot, - orte_process_name_t *name) -{ - orte_snapc_base_local_snapshot_t *app_snapshot = NULL; - opal_list_item_t* item = NULL; - - for(item = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); - item != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); - item = opal_list_get_next(item) ) { - app_snapshot = (orte_snapc_base_local_snapshot_t*)item; - - if( name->jobid == app_snapshot->process_name.jobid && - name->vpid == app_snapshot->process_name.vpid ) { - return app_snapshot; - } - } - - return NULL; -} -static int snapc_full_start_filem(orte_snapc_full_orted_snapshot_t *orted_snapshot) -{ - int ret, exit_status = ORTE_SUCCESS; - orte_filem_base_process_set_t *p_set = NULL; - orte_filem_base_file_set_t * f_set = NULL; - opal_list_t all_filem_requests; - orte_snapc_base_local_snapshot_t *app_snapshot = NULL; - opal_list_item_t* item = NULL; - - OBJ_CONSTRUCT(&all_filem_requests, opal_list_t); - - /* - * If we just want to pretend to do the filem - */ - if(orte_snapc_full_skip_filem) { - exit_status = ORTE_SUCCESS; - goto cleanup; - } - /* - * If it is stored in place, then we do not need to transfer anything - * -- Should not have gotten here, so return an error -- - */ - else if( orte_snapc_base_store_in_place ) { - exit_status = ORTE_ERROR; - goto cleanup; - } - - /* - * Setup the FileM data structures to transfer the files - */ - orted_snapshot->filem_request = OBJ_NEW(orte_filem_base_request_t); - /* - * Construct the process set - */ - p_set = OBJ_NEW(orte_filem_base_process_set_t); - - p_set->source.jobid = orted_snapshot->process_name.jobid; - p_set->source.vpid = orted_snapshot->process_name.vpid; - p_set->sink.jobid = ORTE_PROC_MY_NAME->jobid; - p_set->sink.vpid = ORTE_PROC_MY_NAME->vpid; - - opal_list_append(&(orted_snapshot->filem_request->process_sets), &(p_set->super) ); - - for(item = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); - item != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); - item = opal_list_get_next(item) ) { - app_snapshot = (orte_snapc_base_local_snapshot_t*)item; - - /* If one of the checkpoints failed, we need to return an error */ - if( ORTE_SNAPC_CKPT_STATE_ERROR == app_snapshot->state ) { - exit_status = ORTE_ERROR; - ORTE_ERROR_LOG(ORTE_ERROR); - goto cleanup; - } - - /* - * Construct the file set - */ - f_set = OBJ_NEW(orte_filem_base_file_set_t); - - f_set->local_target = strdup(orte_snapc_base_global_snapshot_loc); - if( orte_snapc_base_is_global_dir_shared ) { - f_set->local_hint = ORTE_FILEM_HINT_SHARED; - } - - asprintf(&(f_set->remote_target), "%s/%s", app_snapshot->remote_location, app_snapshot->reference_name); - - f_set->target_flag = ORTE_FILEM_TYPE_DIR; - - opal_list_append(&(orted_snapshot->filem_request->file_sets), &(f_set->super) ); - - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Global) ... FileM (%s) [%s] --> [%s]", - ORTE_NAME_PRINT(&orted_snapshot->process_name), f_set->remote_target, f_set->local_target)); - } - - /* - * Start the transfer - */ - if(ORTE_SUCCESS != (ret = orte_filem.get_nb(orted_snapshot->filem_request) ) ) { - OBJ_RELEASE(orted_snapshot->filem_request); - orted_snapshot->filem_request = NULL; - exit_status = ret; - ORTE_ERROR_LOG(ret); - goto cleanup; - } - - cleanup: - return exit_status; -} - -static int snapc_full_wait_filem(void) -{ - int ret, exit_status = ORTE_SUCCESS; - opal_list_t all_filem_requests; - orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; - opal_list_item_t* item = NULL; - - OBJ_CONSTRUCT(&all_filem_requests, opal_list_t); - - /* - * Construct a list for wait_all() - */ - for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); - item != opal_list_get_end(&(global_snapshot.local_snapshots)); - item = opal_list_get_next(item) ) { - orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; - - if( NULL != orted_snapshot->filem_request ) { - opal_list_append(&all_filem_requests, &(orted_snapshot->filem_request->super)); - } - } - - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) FileM -- Enter wait_all() Get")); - - /* - * Wait for all transfers to complete - */ - if(ORTE_SUCCESS != (ret = orte_filem.wait_all(&all_filem_requests) ) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) FileM -- Setup removal()")); - - /* - * Start removal of old data - */ - for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); - item != opal_list_get_end(&(global_snapshot.local_snapshots)); - item = opal_list_get_next(item) ) { - orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; - - if( NULL != orted_snapshot->filem_request ) { - if(ORTE_SUCCESS != (ret = orte_filem.rm_nb(orted_snapshot->filem_request)) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - } - } - - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Global) FileM -- Enter wait_all() Remove")); - - /* - * Wait for all removals to complete - */ - if(ORTE_SUCCESS != (ret = orte_filem.wait_all(&all_filem_requests) ) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - cleanup: - for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); - item != opal_list_get_end(&(global_snapshot.local_snapshots)); - item = opal_list_get_next(item) ) { - orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; - - if( NULL != orted_snapshot->filem_request ) { - /*OBJ_RELEASE(orted_snapshot->filem_request);*/ - orted_snapshot->filem_request = NULL; - } - } - - /* JJH I don't think this is needed (??) */ - while (NULL != (item = opal_list_remove_first(&all_filem_requests) ) ) { - OBJ_RELEASE(item); - } - OBJ_DESTRUCT(&all_filem_requests); - return exit_status; -} - static int snapc_full_global_get_min_state(void) { int min_state = ORTE_SNAPC_CKPT_MAX; @@ -2088,6 +2416,8 @@ static int snapc_full_global_get_min_state(void) char * state_str_a = NULL; char * state_str_b = NULL; + current_total_orteds = 0; + for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); item != opal_list_get_end(&(global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { @@ -2096,11 +2426,13 @@ static int snapc_full_global_get_min_state(void) /* Ignore orteds with no processes */ if( 0 >= opal_list_get_size(&(orted_snapshot->super.local_snapshots)) ) { OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Global) ... Skipping - %s (no children)", + "Global) ... %s Skipping - (no children)", ORTE_NAME_PRINT(&orted_snapshot->process_name) )); continue; } + current_total_orteds++; + if( NULL != state_str_a ) { free(state_str_a); state_str_a = NULL; @@ -2114,7 +2446,8 @@ static int snapc_full_global_get_min_state(void) orte_snapc_ckpt_state_str(&state_str_b, min_state); OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Global) ... Checking [%d %s] vs [%d %s]", + "Global) ... %s Checking [%d %s] vs [%d %s]", + ORTE_NAME_PRINT(&orted_snapshot->process_name), (int)orted_snapshot->state, state_str_a, min_state, state_str_b )); @@ -2122,7 +2455,8 @@ static int snapc_full_global_get_min_state(void) min_state = orted_snapshot->state; OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Global) ... Update --> Min State [%d %s]", + "Global) ... %s Update --> Min State [%d %s]", + ORTE_NAME_PRINT(&orted_snapshot->process_name), (int)min_state, state_str_a )); } } @@ -2148,7 +2482,167 @@ static int snapc_full_global_get_min_state(void) return min_state; } -static double get_time(void) { +static int orte_snapc_full_global_reset_coord(void) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_list_item_t* item = NULL; + opal_list_item_t* aitem = NULL; + orte_snapc_full_orted_snapshot_t *orted_snapshot = NULL; + orte_snapc_base_local_snapshot_t *app_snapshot = NULL; + + + /******************************** + * Terminate the job if requested + * At this point the application should have already exited, but do this + * just to make doubly sure that the job is terminated. + *********************************/ + if( global_snapshot.options->term ) { + SNAPC_FULL_DISPLAY_ALL_TIMERS(); + orte_plm.terminate_job(current_global_jobid); + } else { + SNAPC_FULL_DISPLAY_ALL_TIMERS(); + } + + /* + * Just cleanup, do not need to send out another message + */ + opal_crs_base_clear_options(global_snapshot.options); + + /* + * Reset global data structures + */ + for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); + item != opal_list_get_end(&(global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; + + orted_snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; + + for(aitem = opal_list_get_first(&(orted_snapshot->super.local_snapshots)); + aitem != opal_list_get_end(&(orted_snapshot->super.local_snapshots)); + aitem = opal_list_get_next(aitem) ) { + app_snapshot = (orte_snapc_base_local_snapshot_t*)aitem; + + app_snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; + } + } + + /************************ + * Set up the Command Line listener again + *************************/ + is_orte_checkpoint_connected = false; + if( ORTE_SUCCESS != (ret = snapc_full_global_start_cmdline_listener() ) ){ + ORTE_ERROR_LOG(ret); + exit_status = ret; + } + + current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_NONE; + cleanup_on_establish = false; + + report_progress_cur_loc_finished = 0; + report_progress_last_reported_loc_finished = 0; + + return exit_status; +} + +/************************ + * Timing + ************************/ +static void snapc_full_set_time(int idx) +{ + if(idx < SNAPC_FULL_TIMER_MAX ) { + if( timer_start[idx] <= 0.0 ) { + timer_start[idx] = snapc_full_get_time(); + } + } +} + +static void snapc_full_display_all_timers(void) +{ + double diff = 0.0; + char * label = NULL; + + opal_output(0, "Snapshot Coordination Timing: ******************** Summary Begin\n"); + + /********** Startup time **********/ + label = strdup("Running"); + diff = timer_start[SNAPC_FULL_TIMER_RUNNING] - timer_start[SNAPC_FULL_TIMER_START]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + + /********** Time to finish locally **********/ + label = strdup("Finish Locally"); + diff = timer_start[SNAPC_FULL_TIMER_FIN_LOCAL] - timer_start[SNAPC_FULL_TIMER_RUNNING]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + + if( timer_start[SNAPC_FULL_TIMER_SS_SYNC] <= timer_start[SNAPC_FULL_TIMER_RECOVERED] ) { + /********** SStore Sync **********/ + label = strdup("SStore Sync"); + diff = timer_start[SNAPC_FULL_TIMER_SS_SYNC] - timer_start[SNAPC_FULL_TIMER_FIN_LOCAL]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + + /********** Establish Ckpt **********/ + label = strdup("Establish"); + diff = timer_start[SNAPC_FULL_TIMER_ESTABLISH] - timer_start[SNAPC_FULL_TIMER_SS_SYNC]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + + /********** Recover **********/ + label = strdup("Continue/Recover"); + diff = timer_start[SNAPC_FULL_TIMER_RECOVERED] - timer_start[SNAPC_FULL_TIMER_ESTABLISH]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + } else { /* Established after procs recovered */ + /********** SStore Sync **********/ + label = strdup("SStore Sync*"); + diff = timer_start[SNAPC_FULL_TIMER_SS_SYNC] - timer_start[SNAPC_FULL_TIMER_RECOVERED]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + + /********** Establish Ckpt **********/ + label = strdup("Establish*"); + diff = timer_start[SNAPC_FULL_TIMER_ESTABLISH] - timer_start[SNAPC_FULL_TIMER_SS_SYNC]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + + /********** Recover **********/ + label = strdup("Continue/Recover*"); + diff = timer_start[SNAPC_FULL_TIMER_RECOVERED] - timer_start[SNAPC_FULL_TIMER_FIN_LOCAL]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + } + + opal_output(0, "Snapshot Coordination Timing: ******************** Summary End\n"); +} + +static void snapc_full_display_recovered_timers(void) +{ + double diff = 0.0; + char * label = NULL; + + opal_output(0, "Snapshot Coordination Timing: ******************** Summary Begin\n"); + + /********** Recover **********/ + label = strdup("Recover"); + diff = timer_start[SNAPC_FULL_TIMER_RECOVERED] - timer_start[SNAPC_FULL_TIMER_START]; + snapc_full_display_indv_timer_core(diff, label); + free(label); + + opal_output(0, "Snapshot Coordination Timing: ******************** Summary End\n"); +} + +static void snapc_full_clear_timers(void) +{ + int i; + for(i = 0; i < SNAPC_FULL_TIMER_MAX; ++i) { + timer_start[i] = 0.0; + } +} + +static double snapc_full_get_time(void) +{ double wtime; #if OPAL_TIMER_USEC_NATIVE @@ -2163,30 +2657,62 @@ static double get_time(void) { return wtime; } -static void print_time(void) { - double t_local, t_transfer, t_cleanup, t_total; +static void snapc_full_display_indv_timer_core(double diff, char *str) +{ + double total = 0; + double perc = 0; - if(!orte_snapc_full_timing_enabled) { + if( timer_start[SNAPC_FULL_TIMER_SS_SYNC] <= timer_start[SNAPC_FULL_TIMER_RECOVERED] ) { + total = timer_start[SNAPC_FULL_TIMER_RECOVERED] - timer_start[SNAPC_FULL_TIMER_START]; + } else { + total = timer_start[SNAPC_FULL_TIMER_ESTABLISH] - timer_start[SNAPC_FULL_TIMER_START]; + } + perc = (diff/total) * 100; + + opal_output(0, + "snapc_full: timing: %-20s = %10.2f s\t%10.2f s\t%6.2f\n", + str, + diff, + total, + perc); + return; +} + +static void snapc_full_report_progress(orte_snapc_full_orted_snapshot_t *orted_snapshot, int total, int min_state) +{ + orte_snapc_full_orted_snapshot_t *loc_orted_snapshot = NULL; + opal_list_item_t* item = NULL; + double perc_done; + + if( ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL != orted_snapshot->state ) { return; } - t_total = timer_end - timer_start; + report_progress_cur_loc_finished++; + perc_done = (total-report_progress_cur_loc_finished)/(total*1.0); + perc_done = (perc_done-1)*(-100.0); - t_local = timer_local_done - timer_start; - - if(orte_snapc_base_store_in_place || orte_snapc_full_skip_filem) { - t_transfer = 0; - t_cleanup = timer_end - timer_local_done; - } else { - t_transfer = timer_xfer_done - timer_local_done; - t_cleanup = timer_end - timer_xfer_done; + if( perc_done >= (report_progress_last_reported_loc_finished + orte_snapc_full_progress_meter) || + report_progress_last_reported_loc_finished == 0.0 ) { + report_progress_last_reported_loc_finished = perc_done; + opal_output(0, "snapc_full: progress: %10.2f %c Locally Finished\n", + perc_done, '%'); } - opal_output(0, "Checkpoint Time:"); - opal_output(0, "\tLocal : %10.2f s\n", t_local); - opal_output(0, "\tTransfer: %10.2f s\n", t_transfer); - opal_output(0, "\tCleanup : %10.2f s\n", t_cleanup); - opal_output(0, "\tTotal : %10.2f s\n", t_total); + if( perc_done > 95.0 ) { + opal_output(0, "snapc_full: progress: Waiting on the following daemons (%10.2f %c):", perc_done, '%'); + + for(item = opal_list_get_first(&(global_snapshot.local_snapshots)); + item != opal_list_get_end(&(global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + loc_orted_snapshot = (orte_snapc_full_orted_snapshot_t*)item; + + if( ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL != loc_orted_snapshot->state ) { + opal_output(0, "snapc_full: progress: Daemon %s", + ORTE_NAME_PRINT(&loc_orted_snapshot->process_name)); + } + } + } return; } diff --git a/orte/mca/snapc/full/snapc_full_local.c b/orte/mca/snapc/full/snapc_full_local.c index e8ff1363cd..3be4d49f60 100644 --- a/orte/mca/snapc/full/snapc_full_local.c +++ b/orte/mca/snapc/full/snapc_full_local.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -62,6 +62,8 @@ #include "orte/mca/odls/odls.h" #include "orte/mca/odls/base/odls_private.h" #include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/routed/routed.h" +#include "orte/mca/grpcomm/grpcomm.h" #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" @@ -72,11 +74,16 @@ * Locally Global vars & functions :) ************************************/ static orte_jobid_t current_local_jobid = ORTE_JOBID_INVALID; -static opal_list_t snapc_local_vpids; +static orte_snapc_base_global_snapshot_t local_global_snapshot; + static int current_job_ckpt_state = ORTE_SNAPC_CKPT_STATE_NONE; -static opal_crs_base_ckpt_options_t *current_local_options = NULL; -static char * global_ckpt_ref = NULL; +static bool currently_migrating = false; +static bool flushed_modex = false; +static bool sstore_local_sync_finished = false; +static bool sstore_local_procs_finished = false; + +static int local_define_pipe_names(orte_snapc_full_app_snapshot_t *vpid_snapshot); static bool snapc_local_hnp_recv_issued = false; static int snapc_full_local_start_hnp_listener(void); @@ -100,14 +107,19 @@ static void snapc_full_local_process_app_update_cmd(int fd, short event, void *c static orte_snapc_full_app_snapshot_t *find_vpid_snapshot(orte_process_name_t *name ); static int snapc_full_local_get_vpids(void); +static int snapc_full_local_refresh_vpids(void); + +#if OPAL_ENABLE_CRDEBUG == 1 +static int snapc_full_local_send_restart_proc_info(void); +#endif static void snapc_full_local_process_job_update_cmd(orte_process_name_t* sender, opal_buffer_t* buffer, bool quick); static int local_coord_job_state_update_finished_local(void); +static int local_coord_job_state_update_finished_local_vpid(orte_snapc_full_app_snapshot_t *vpid_snapshot); -static int snapc_full_local_setup_snapshot_dir(char * snapshot_ref, char * sugg_dir, char **actual_dir); #if 0 static int snapc_full_establish_dir(void); #endif @@ -124,6 +136,7 @@ static int snapc_full_local_start_ckpt_handshake(orte_snapc_full_app_snapshot_t static int snapc_full_local_end_ckpt_handshake(orte_snapc_full_app_snapshot_t *vpid_snapshot); static void snapc_full_local_comm_read_event(int fd, short flags, void *arg); +static int orte_snapc_full_local_reset_coord(void); /************************ * Function Definitions @@ -150,16 +163,56 @@ int local_coord_finalize( void ) int local_coord_setup_job(orte_jobid_t jobid) { int ret, exit_status = ORTE_SUCCESS; - - current_local_options = OBJ_NEW(opal_crs_base_ckpt_options_t); + orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; + opal_list_item_t* item = NULL; /* * Set the jobid that we are responsible for */ if( jobid == current_local_jobid ) { - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Local) Already setup job %s.", - ORTE_JOBID_PRINT(jobid) )); + /* If we pass this way twice, we must be restarting. + * so just refresh the vpid structure + */ + if( currently_migrating ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Restarting Job %s from Migration...", + ORTE_JOBID_PRINT(jobid))); + if( ORTE_SUCCESS != (ret = snapc_full_local_refresh_vpids() ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + else { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Restarting Job %s...", + ORTE_JOBID_PRINT(jobid))); + + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + opal_list_remove_item(&(local_global_snapshot.local_snapshots), item); + } + + if( ORTE_SUCCESS != (ret = snapc_full_local_get_vpids() ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Restarting Job %s: Daemon %s \t Process %s", + ORTE_JOBID_PRINT(jobid), + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name))); + } + } + exit_status = ORTE_SUCCESS; goto cleanup; } @@ -179,7 +232,8 @@ int local_coord_setup_job(orte_jobid_t jobid) /* * Get the list of vpid's that we care about */ - OBJ_CONSTRUCT(&snapc_local_vpids, opal_list_t); + OBJ_CONSTRUCT(&local_global_snapshot, orte_snapc_base_global_snapshot_t); + if( ORTE_SUCCESS != (ret = snapc_full_local_get_vpids()) ) { ORTE_ERROR_LOG(ret); exit_status = ret; @@ -241,19 +295,20 @@ int local_coord_release_job(orte_jobid_t jobid) do { is_done = true; - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; - if(ORTE_SNAPC_CKPT_STATE_NONE != vpid_snapshot->super.state && - ORTE_SNAPC_CKPT_STATE_ERROR != vpid_snapshot->super.state && - ORTE_SNAPC_CKPT_STATE_FINISHED != vpid_snapshot->super.state ) { + if(ORTE_SNAPC_CKPT_STATE_NONE != vpid_snapshot->super.state && + ORTE_SNAPC_CKPT_STATE_ERROR != vpid_snapshot->super.state && + ORTE_SNAPC_CKPT_STATE_ESTABLISHED != vpid_snapshot->super.state && + ORTE_SNAPC_CKPT_STATE_RECOVERED != vpid_snapshot->super.state ) { is_done = false; break; } else { - opal_list_remove_item(&snapc_local_vpids, item); + opal_list_remove_item(&(local_global_snapshot.local_snapshots), item); } } if( !is_done ) { @@ -261,7 +316,7 @@ int local_coord_release_job(orte_jobid_t jobid) } } while(!is_done); - OBJ_DESTRUCT(&snapc_local_vpids); + OBJ_DESTRUCT(&local_global_snapshot); /* * Stop Global Coordinator listeners @@ -276,11 +331,6 @@ int local_coord_release_job(orte_jobid_t jobid) exit_status = ret; } - if( NULL != current_local_options ) { - OBJ_RELEASE(current_local_options); - current_local_options = NULL; - } - return exit_status; } @@ -430,12 +480,6 @@ void snapc_full_local_app_cmd_recv(int status, return; } - /* - * This is the local process contacting us with its updated pid information - */ - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Local) Application: Update pid operation.")); - /* * Do not handle here, use the event engine to queue this until we are out * of the RML @@ -488,11 +532,21 @@ static void snapc_full_local_process_app_update_cmd(int fd, short event, void *c { int ret; orte_message_event_t *mev = (orte_message_event_t*)cbdata; + opal_list_item_t* item = NULL; orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; orte_snapc_cmd_flag_t command; orte_process_name_t proc; pid_t proc_pid = 0; orte_std_cntr_t count; + int cr_state; + orte_process_name_t *sender = NULL; + bool is_done; +#if OPAL_ENABLE_CRDEBUG == 1 + bool all_done = false; + bool crdebug_enabled = false; +#endif + + sender = &(mev->sender); /* * Verify the command @@ -503,47 +557,247 @@ static void snapc_full_local_process_app_update_cmd(int fd, short event, void *c goto cleanup; } - if( ORTE_SNAPC_LOCAL_UPDATE_CMD != command ) { + if( ORTE_SNAPC_LOCAL_UPDATE_CMD != command && + ORTE_SNAPC_LOCAL_FINISH_CMD != command ) { OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Local) Warning: Expected an application command (%d) but received (%d)\n", ORTE_SNAPC_LOCAL_UPDATE_CMD, command)); goto cleanup; } - /* - * Unpack the data - * - process name - * - PID - */ - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &proc, &count, ORTE_NAME))) { - ORTE_ERROR_LOG(ret); - goto cleanup; + if( ORTE_SNAPC_LOCAL_UPDATE_CMD == command ) { + /* + * This is the local process contacting us with its updated pid information + */ + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Application: Update pid operation.")); + + /* + * Unpack the data + * - process name + * - PID + */ + count = 1; + /* JJH CLEANUP: Do we really need this, it is equal to sender */ + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &proc, &count, ORTE_NAME))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &proc_pid, &count, OPAL_PID))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } +#if OPAL_ENABLE_CRDEBUG == 1 + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &crdebug_enabled, &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } +#endif + + if( NULL == (vpid_snapshot = find_vpid_snapshot(&proc)) ) { + ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND); + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Updated PID: %s : %d -> %d", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), vpid_snapshot->process_pid, proc_pid)); + + /* JJH: Maybe we should save the old and the newly restarted pid? */ + vpid_snapshot->process_pid = proc_pid; + vpid_snapshot->finished = true; + +#if OPAL_ENABLE_CRDEBUG == 1 + /* + * Once we have received all updates we should send them to the Global coord + */ + if( crdebug_enabled ) { + all_done = true; + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + if( !vpid_snapshot->finished ) { + all_done = false; + break; + } + } + if( all_done ) { + /* If C/R Debugging then send hostlist */ + if( ORTE_SUCCESS != (ret = snapc_full_local_send_restart_proc_info() ) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + } + } +#endif + + /* Note: We should not update the ORTE structure since, if the CRS uses + * an intermediary restart mechanism (e.g., BLCR's cr_restart) that + * forks a child, then this process cannot call waitpit() on it. + */ } - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &proc_pid, &count, OPAL_PID))) { - ORTE_ERROR_LOG(ret); - goto cleanup; + else if( ORTE_SNAPC_LOCAL_FINISH_CMD == command ) { + /* + * Unpack the data + * - cr_state + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &cr_state, &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + if( NULL == (vpid_snapshot = find_vpid_snapshot(sender)) ) { + opal_output(0, "Local) Failed to find process %s", + ORTE_NAME_PRINT(sender)); + ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND); + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Process %s Finished Recovery (%d)", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), + cr_state)); + + vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_RECOVERED; + + /* + * Check if we are done + */ + is_done = true; + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + + if( ORTE_SNAPC_CKPT_STATE_RECOVERED != vpid_snapshot->super.state ) { + is_done = false; + break; + } + } + + if( is_done ) { + /* + * Tell the Global Coordinator that all of our processes are finished + */ + OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, + "Local) Job Ckpt finished - Confirmed! Tell the Global Coord\n")); + + if( ORTE_SUCCESS != (ret = snapc_full_local_update_coord(ORTE_SNAPC_CKPT_STATE_RECOVERED, true) ) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * If we are not finished sync'ing then delay cleanup + */ + if( !sstore_local_sync_finished ) { + sstore_local_procs_finished = true; + OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, + "Local) Job Ckpt finished - Confirmed! Not finished Syncing...\n")); + } else { + /* + * Cleanup + */ + if( ORTE_SUCCESS != (ret = orte_snapc_full_local_reset_coord()) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + } + } } - if( NULL == (vpid_snapshot = find_vpid_snapshot(&proc)) ) { - ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND); - goto cleanup; - } - - OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Local) Updated PID: %s : %d -> %d", - ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), vpid_snapshot->process_pid, proc_pid)); - - /* JJH: Maybe we should save the old and the newly restarted pid? */ - vpid_snapshot->process_pid = proc_pid; - cleanup: /* release the message event */ OBJ_RELEASE(mev); return; } +#if OPAL_ENABLE_CRDEBUG == 1 +static int snapc_full_local_send_restart_proc_info(void) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; + opal_list_item_t* item = NULL; + opal_buffer_t buffer; + orte_snapc_full_cmd_flag_t command = ORTE_SNAPC_FULL_RESTART_PROC_INFO; + size_t num_vpids = 0; + + /* + * Global Coordinator: Operate locally + */ + if( ORTE_SNAPC_GLOBAL_COORD_TYPE == (orte_snapc_coord_type & ORTE_SNAPC_GLOBAL_COORD_TYPE)) { + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + global_coord_restart_proc_info(vpid_snapshot->process_pid, orte_process_info.nodename); + } + /* stdout may be buffered by the C library so it needs to be flushed so + * that the debugger can read the process info. + */ + fflush(stdout); + return ORTE_SUCCESS; + } + + /* + * Local Coordinator: Send Global Coordinator the information + * [ hostname, num_pids, {pids} ] + */ + num_vpids = opal_list_get_size(&(local_global_snapshot.local_snapshots)); + if( num_vpids <= 0 ) { + return exit_status; + } + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SNAPC_FULL_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(orte_process_info.nodename), 1, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &num_vpids, 1, OPAL_SIZE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(vpid_snapshot->process_pid), 1, OPAL_PID))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} +#endif + static void snapc_full_process_hnp_request_cmd(int fd, short event, void *cbdata) { int ret; @@ -575,6 +829,12 @@ static void snapc_full_process_hnp_request_cmd(int fd, short event, void *cbdata snapc_full_local_process_job_update_cmd(sender, mev->buffer, false); break; + case ORTE_SNAPC_FULL_RESTART_PROC_INFO: + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Command: Update hostname/pid associations")); + /* Nothing to do */ + break; + default: ORTE_ERROR_LOG(ORTE_ERR_VALUE_OUT_OF_BOUNDS); } @@ -592,21 +852,24 @@ static void snapc_full_local_process_job_update_cmd(orte_process_name_t* sender, int ret, exit_status = ORTE_SUCCESS; orte_jobid_t jobid; int job_ckpt_state; - char *job_ckpt_ref = NULL; - char *job_ckpt_loc = NULL; orte_std_cntr_t count; opal_crs_base_ckpt_options_t *options = NULL; + bool loc_migrating = false; + size_t loc_num_procs = 0; + orte_process_name_t proc_name; + size_t i; + orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; + opal_list_item_t* item = NULL; + orte_sstore_base_handle_t ss_handle; /* * Unpack the data (quick) * - jobid * - ckpt_state + * - sstore_handle * Unpack the data (long) * - jobid * - ckpt_state - * - ckpt_reference - * - ckpt_location - * - ckpt_seq_number * - ckpt_options */ count = 1; @@ -624,22 +887,7 @@ static void snapc_full_local_process_job_update_cmd(orte_process_name_t* sender, } if( !quick ) { - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &job_ckpt_ref, &count, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &job_ckpt_loc, &count, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - count = 1; - if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &orte_snapc_base_snapshot_seq_number, &count, OPAL_SIZE))) { + if (ORTE_SUCCESS != (ret = orte_sstore.unpack_handle(sender, buffer, &ss_handle)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -654,14 +902,56 @@ static void snapc_full_local_process_job_update_cmd(orte_process_name_t* sender, /* In this case we want to use the current_local_options that are cached * so that we do not have to send them every time. */ - opal_crs_base_copy_options(options, current_local_options); + opal_crs_base_copy_options(options, local_global_snapshot.options); + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(loc_migrating), &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( loc_migrating ) { + currently_migrating = true; + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &loc_num_procs, &count, OPAL_SIZE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + for( i = 0; i < loc_num_procs; ++i ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &proc_name, &count, ORTE_NAME))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * See if we are watching this process + */ + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + + if( OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, + &(vpid_snapshot->super.process_name), + &proc_name) ) { + vpid_snapshot->migrating = true; + break; + } + } + } + } } if( ORTE_SUCCESS != (ret = local_coord_job_state_update(jobid, job_ckpt_state, - &job_ckpt_ref, - &job_ckpt_loc, - current_local_options)) ) { + ss_handle, + local_global_snapshot.options)) ) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; @@ -679,8 +969,7 @@ static void snapc_full_local_process_job_update_cmd(orte_process_name_t* sender, int local_coord_job_state_update(orte_jobid_t jobid, int job_ckpt_state, - char **job_ckpt_ref, - char **job_ckpt_loc, + orte_sstore_base_handle_t ss_handle, opal_crs_base_ckpt_options_t *options) { int ret, exit_status = ORTE_SUCCESS; @@ -688,16 +977,8 @@ int local_coord_job_state_update(orte_jobid_t jobid, opal_list_item_t* item = NULL; char * state_str = NULL; - if( NULL != *job_ckpt_ref ) { - if( NULL != global_ckpt_ref ) { - free(global_ckpt_ref); - global_ckpt_ref = NULL; - } - global_ckpt_ref = strdup(*job_ckpt_ref); - } - /* Save Options */ - opal_crs_base_copy_options(options, current_local_options); + opal_crs_base_copy_options(options, local_global_snapshot.options); OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, "Local) Job %s: Changed to state to:\n", @@ -709,17 +990,6 @@ int local_coord_job_state_update(orte_jobid_t jobid, free(state_str); state_str = NULL; - if( NULL != *job_ckpt_ref ) { - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Local) Snapshot Ref: (%s)\n", - *job_ckpt_ref)); - } - if( NULL != *job_ckpt_loc ) { - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Local) Remote Location: (%s)\n", - *job_ckpt_loc)); - } - /* * Update the vpid structure if we need to. * Really only need to if we don't have valid information (PID) @@ -737,60 +1007,22 @@ int local_coord_job_state_update(orte_jobid_t jobid, * If we have been asked to checkpoint do so */ if( ORTE_SNAPC_CKPT_STATE_PENDING == job_ckpt_state ) { + /* + * Register with the SStore + */ + local_global_snapshot.ss_handle = ss_handle; + orte_sstore.register_handle(local_global_snapshot.ss_handle); /* * For each of the processes we are tasked with, start their checkpoints */ - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; vpid_snapshot->super.state = job_ckpt_state; - - opal_crs_base_copy_options(options, vpid_snapshot->options); - - /* - * Update it's local information - */ - if( NULL != vpid_snapshot->super.reference_name ) { - free(vpid_snapshot->super.reference_name); - vpid_snapshot->super.reference_name = NULL; - } - vpid_snapshot->super.reference_name = opal_crs_base_unique_snapshot_name(vpid_snapshot->super.process_name.vpid); - - /* global_directory/local_snapshot_vpid/... */ - if( NULL != vpid_snapshot->super.local_location ) { - free(vpid_snapshot->super.local_location); - vpid_snapshot->super.local_location = NULL; - } - - if( orte_snapc_base_store_in_place ) { - asprintf(&(vpid_snapshot->super.local_location), - "%s/%s", - *job_ckpt_loc, - vpid_snapshot->super.reference_name); - } - else { - /* Use the OPAL CRS base snapshot dir - * JJH: Do we want to do something more interesting? - */ - asprintf(&(vpid_snapshot->super.local_location), - "%s/%s", - opal_crs_base_snapshot_dir, - vpid_snapshot->super.reference_name); - } - - if( NULL != vpid_snapshot->super.remote_location ) { - free(vpid_snapshot->super.remote_location); - vpid_snapshot->super.remote_location = NULL; - } - - asprintf(&(vpid_snapshot->super.remote_location), - "%s/%s", - *job_ckpt_loc, - vpid_snapshot->super.reference_name); - + vpid_snapshot->finished = false; } /* @@ -802,6 +1034,24 @@ int local_coord_job_state_update(orte_jobid_t jobid, goto cleanup; } } + else if( ORTE_SNAPC_CKPT_STATE_MIGRATING == job_ckpt_state ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Migrating: Display a list of processes migrating")); + + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + /* + * If this process migrated away, then remove it from our list. + */ + if( vpid_snapshot->migrating ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Migrating: %s", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name) )); + } + } + } /* * Release all checkpointed processes now that the checkpoint is complete * If the request was to checkpoint then terminate this command will tell @@ -820,41 +1070,10 @@ int local_coord_job_state_update(orte_jobid_t jobid, * Once we get the FINISHED state then the checkpoint is all done, and we * reset our state to NONE. */ - else if( ORTE_SNAPC_CKPT_STATE_FINISHED == job_ckpt_state ) { - OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, - "Local) Job Ckpt finished - Cleanup\n")); - - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); - item = opal_list_get_next(item) ) { - vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; - - /* Forgot to close the pipes to the application - * This can happen if we never received a FINISHED_LOCAL, but only - * a FINISHED - */ - if( vpid_snapshot->comm_pipe_w_fd > 0 ) { - if( ORTE_SUCCESS != (ret = local_coord_job_state_update_finished_local() ) ) { - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - } - - vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_NONE; - opal_crs_base_clear_options(vpid_snapshot->options); - } - + else if( ORTE_SNAPC_CKPT_STATE_ESTABLISHED == job_ckpt_state ) { /* - * Clear globally cached options + * Wait to cleanup until all have reported */ - opal_crs_base_clear_options(current_local_options); - } - /* - * States not handled - */ - else if( ORTE_SNAPC_CKPT_STATE_FILE_XFER == job_ckpt_state ) { - ; } else { ; @@ -871,34 +1090,63 @@ int local_coord_job_state_update(orte_jobid_t jobid, static int local_coord_job_state_update_finished_local(void) { - int ret; + int ret, exit_status = ORTE_SUCCESS; orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; opal_list_item_t* item = NULL; OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, "Local) Job Ckpt finished tell all processes\n")); - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; - OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, - "Local) Tell process %s\n", - ORTE_NAME_PRINT(&vpid_snapshot->super.process_name))); - - if( ORTE_SUCCESS != (ret = snapc_full_local_end_ckpt_handshake(vpid_snapshot) ) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "Local) Error: Unable to finish the handshake with peer %s. %d\n", - ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), ret); + if( ORTE_SUCCESS != (ret = local_coord_job_state_update_finished_local_vpid(vpid_snapshot) ) ) { ORTE_ERROR_LOG(ORTE_ERROR); - return ORTE_ERROR; + exit_status = ORTE_ERROR; + goto cleanup; } + + if( vpid_snapshot->migrating ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Removing Migrated Process: %s", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name) )); + opal_list_remove_item(&(local_global_snapshot.local_snapshots), item); + } + } + cleanup: + return exit_status; +} + +static int local_coord_job_state_update_finished_local_vpid(orte_snapc_full_app_snapshot_t *vpid_snapshot) +{ + int ret; + + OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, + "Local) Tell process %s (Ckpt Finished) %s\n", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), + (vpid_snapshot->migrating ? "- Migrating, Skip" : "") )); + + /* + * If this process is migrating, it has already been told + */ + if( vpid_snapshot->migrating ) { + return ORTE_SUCCESS; + } + + if( ORTE_SUCCESS != (ret = snapc_full_local_end_ckpt_handshake(vpid_snapshot) ) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Local) Error: Unable to finish the handshake with peer %s. %d\n", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), ret); + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; } return ORTE_SUCCESS; } + /************************ * Start the checkpoint ************************/ @@ -908,51 +1156,21 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, int ret, exit_status = ORTE_SUCCESS; orte_snapc_full_app_snapshot_t *vpid_snapshot; opal_list_item_t* item = NULL; - char * actual_local_dir = NULL; - char *tmp_pid = NULL; size_t num_stopped = 0; int waitpid_status = 0; /* - * Cannot let opal-checkpoint be passed the --term flag - * since the HNP needs to talk to the app to get - * information for FileM. HNP will issue the termination. - * JJH: Eventually release the contraint that the app needs to - * be alive for FileM to properly work. - * However if we are storing in place, then we don't use - * the FileM framework and can terminate the application - * from this command. + * Pass 1: make sure all vpids are setup correctly + * This is a sanity check. Most of the time it will not be necessary. */ - if ( !orte_snapc_base_store_in_place ) { - options->term = false; - } + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Local) start() Pass 1: Sanity check")); - /* - * Pass 1: Setup snapshot directory - */ - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; - /* - * Set up the snapshot directory per suggestion from - * the Global Snapshot Coordinator - * If we can't create the suggested local directory, do what we can and update - * local directory reference in the GPR - */ - if( ORTE_SUCCESS != (ret = snapc_full_local_setup_snapshot_dir(vpid_snapshot->super.reference_name, - vpid_snapshot->super.local_location, - &actual_local_dir) ) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, - "Local) Using directory (%s)\n", - vpid_snapshot->super.local_location)); - /* Dummy check */ if( vpid_snapshot->process_pid == 0 ) { ret = snapc_full_local_get_vpids(); @@ -964,30 +1182,24 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, exit_status = ORTE_ERROR; goto cleanup; } + break; } } /* * Pass 2: Start process of opening communication channels */ - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Local) start() Pass 2: Signal Procs")); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; /* * Create named pipe references for this process */ - if( NULL == vpid_snapshot->comm_pipe_w || - NULL == vpid_snapshot->comm_pipe_r ) { - if( NULL != tmp_pid ) { - free(tmp_pid); - tmp_pid = NULL; - } - asprintf(&tmp_pid, "%d", vpid_snapshot->process_pid); - asprintf(&(vpid_snapshot->comm_pipe_w), "%s/%s.%s", opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_R, tmp_pid); - asprintf(&(vpid_snapshot->comm_pipe_r), "%s/%s.%s", opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_W, tmp_pid); - } + local_define_pipe_names(vpid_snapshot); OPAL_OUTPUT_VERBOSE((20, mca_snapc_full_component.super.output_handle, "Local) Signal process (%d) with signal %d\n", @@ -1011,8 +1223,10 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, /* * Pass 3: Wait for channels to open up */ - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Local) start() Pass 3: Open pipes")); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; @@ -1024,23 +1238,17 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, exit_status = ORTE_ERROR; goto cleanup; } + vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_RUNNING; } /* - * Progress Update to Global Coordinator + * Pass 4: Start Handshake, send option argument set and sstore handle */ - if( ORTE_SUCCESS != (ret = snapc_full_local_update_coord(ORTE_SNAPC_CKPT_STATE_RUNNING, true) ) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - /* - * Pass 4: Start Handshake, send term argument - */ - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Local) start() Pass 4: Start handshake")); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; @@ -1055,10 +1263,12 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, } /* - * Pass 5: Start Handshake, send snapshot reference/location arguments + * Pass 5: Start checkpoint */ - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Local) start() Pass 5: Start checkpoints")); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; @@ -1072,16 +1282,27 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, } } + /* + * Progress Update to Global Coordinator + */ + OPAL_OUTPUT_VERBOSE((5, mca_snapc_full_component.super.output_handle, + "Local) start() Pass 6: Tell Global Coord that we are running now")); + if( ORTE_SUCCESS != (ret = snapc_full_local_update_coord(ORTE_SNAPC_CKPT_STATE_RUNNING, true) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + /* * If stopping then wait for all processes to stop */ if( options->stop ) { - while( num_stopped < opal_list_get_size(&snapc_local_vpids) ) { + while( num_stopped < opal_list_get_size(&(local_global_snapshot.local_snapshots)) ) { opal_progress(); sleep(1); - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; @@ -1105,10 +1326,11 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, } skip_wait: - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_STOPPED; } @@ -1116,6 +1338,11 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, "Local) All Children have now been stopped [total = %d]", (int)num_stopped )); + /* + * Finish the local snapshot + */ + orte_sstore.sync(local_global_snapshot.ss_handle); + /* * Progress Update to Global Coordinator */ @@ -1127,11 +1354,6 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, } cleanup: - if( NULL != tmp_pid ) { - free(tmp_pid); - tmp_pid = NULL; - } - if( ORTE_SUCCESS != exit_status ) { ckpt_state = ORTE_SNAPC_CKPT_STATE_ERROR; } @@ -1139,16 +1361,40 @@ static int snapc_full_local_start_checkpoint_all(int ckpt_state, return exit_status; } +static int local_define_pipe_names(orte_snapc_full_app_snapshot_t *vpid_snapshot) +{ + if( NULL != vpid_snapshot->comm_pipe_r ) { + free(vpid_snapshot->comm_pipe_r); + vpid_snapshot->comm_pipe_r = NULL; + } + + if( NULL != vpid_snapshot->comm_pipe_w ) { + free(vpid_snapshot->comm_pipe_w); + vpid_snapshot->comm_pipe_w = NULL; + } + + asprintf(&(vpid_snapshot->comm_pipe_w), + "%s/%s.%d_%d", + opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_R, + vpid_snapshot->process_pid, + vpid_snapshot->unique_pipe_id); + + asprintf(&(vpid_snapshot->comm_pipe_r), + "%s/%s.%d_%d", + opal_cr_pipe_dir, OPAL_CR_NAMED_PROG_W, + vpid_snapshot->process_pid, + vpid_snapshot->unique_pipe_id); + + (vpid_snapshot->unique_pipe_id)++; + + return ORTE_SUCCESS; +} + static int snapc_full_local_update_coord(int state, bool quick) { int ret, exit_status = ORTE_SUCCESS; opal_buffer_t buffer; orte_snapc_full_cmd_flag_t command; - opal_list_item_t* item = NULL; - orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; - char *crs_agent = NULL; - size_t sz = 0; - char *loc_location = NULL; /* * Local Coordinator: Send Global Coordinator state information @@ -1173,49 +1419,15 @@ static int snapc_full_local_update_coord(int state, bool quick) } /* Optionally send only an abbreviated message to improve scalability */ + /* JJH: Though there is currently no additional information sent in a long + * message versus a small message, keep this logic so that in the + * future it can be easily reused without substantially modifying + * the component. + */ if( quick ) { goto send_data; } - crs_agent = strdup(opal_crs_base_selected_component.base_version.mca_component_name); - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(crs_agent), 1, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - sz = opal_list_get_size(&snapc_local_vpids); - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &sz, 1, OPAL_SIZE))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); - item = opal_list_get_next(item) ) { - vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; - - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(vpid_snapshot->super.process_name), 1, ORTE_NAME))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(vpid_snapshot->super.reference_name), 1, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - - loc_location = opal_dirname(vpid_snapshot->super.local_location); - if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(loc_location), 1, OPAL_STRING))) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - } - send_data: if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SNAPC_FULL, 0))) { ORTE_ERROR_LOG(ret); @@ -1226,15 +1438,6 @@ static int snapc_full_local_update_coord(int state, bool quick) cleanup: OBJ_DESTRUCT(&buffer); - if( NULL != crs_agent ) { - free(crs_agent); - crs_agent = NULL; - } - if( NULL != loc_location ) { - free(loc_location); - loc_location = NULL; - } - return exit_status; } @@ -1354,9 +1557,16 @@ static int snapc_full_local_start_ckpt_handshake_opts(orte_snapc_full_app_snapsh /* * Start the handshake: + * - Send the migrating options [All, this proc] * - Send term argument * - Send stop argument */ + if( vpid_snapshot->migrating ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Tell app to MIGRATE. [%s (%d)]\n", + (vpid_snapshot->migrating ? "True" : "False"), + (int)(currently_migrating) )); + } if( options->term ) { OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, "Local) Tell app to TERMINATE after completion of checkpoint. [%s]\n", @@ -1368,10 +1578,29 @@ static int snapc_full_local_start_ckpt_handshake_opts(orte_snapc_full_app_snapsh (options->stop ? "True" : "False") )); } + + opt_rep = (int)(currently_migrating); + if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Local) Error: Unable to write migrating (%d) to named pipe (%s), %d\n", + vpid_snapshot->migrating, vpid_snapshot->comm_pipe_w, ret); + exit_status = OPAL_ERROR; + goto cleanup; + } + + opt_rep = (int)(vpid_snapshot->migrating); + if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Local) Error: Unable to write migrating (%d) to named pipe (%s), %d\n", + vpid_snapshot->migrating, vpid_snapshot->comm_pipe_w, ret); + exit_status = OPAL_ERROR; + goto cleanup; + } + opt_rep = (int)(options->term); if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write term (%d) to named pipe (%s), %d\n", + "Local) Error: Unable to write term (%d) to named pipe (%s), %d\n", options->term, vpid_snapshot->comm_pipe_w, ret); exit_status = OPAL_ERROR; goto cleanup; @@ -1380,12 +1609,63 @@ static int snapc_full_local_start_ckpt_handshake_opts(orte_snapc_full_app_snapsh opt_rep = (int)(options->stop); if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write stop (%d) to named pipe (%s), %d\n", + "Local) Error: Unable to write stop (%d) to named pipe (%s), %d\n", options->stop, vpid_snapshot->comm_pipe_w, ret); exit_status = OPAL_ERROR; goto cleanup; } + opt_rep = (int)(options->inc_prep_only); + if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Local) Error: Unable to write inc_prep_only (%d) to named pipe (%s), %d\n", + options->stop, vpid_snapshot->comm_pipe_w, ret); + exit_status = OPAL_ERROR; + goto cleanup; + } + + opt_rep = (int)(options->inc_recover_only); + if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Local) Error: Unable to write inc_recover_only (%d) to named pipe (%s), %d\n", + options->stop, vpid_snapshot->comm_pipe_w, ret); + exit_status = OPAL_ERROR; + goto cleanup; + } + +#if OPAL_ENABLE_CRDEBUG == 1 + opt_rep = (int)(options->attach_debugger); + if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "local) Error: Unable to write attach_debugger (%d) to named pipe (%s), %d\n", + options->attach_debugger, vpid_snapshot->comm_pipe_w, ret); + exit_status = OPAL_ERROR; + goto cleanup; + } + + opt_rep = (int)(options->detach_debugger); + if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &opt_rep, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "local) Error: Unable to write detach_debugger (%d) to named pipe (%s), %d\n", + options->detach_debugger, vpid_snapshot->comm_pipe_w, ret); + exit_status = OPAL_ERROR; + goto cleanup; + } +#endif + + /* + * Send the SStore handle + */ + if( sizeof(orte_sstore_base_handle_t) != (ret = write(vpid_snapshot->comm_pipe_w_fd, + &(local_global_snapshot.ss_handle), sizeof(orte_sstore_base_handle_t) )) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Local) Error: Unable to write sstore handle (%d) to named pipe (%s). %d\n", + (int)(local_global_snapshot.ss_handle), vpid_snapshot->comm_pipe_w, ret); + ORTE_ERROR_LOG(OPAL_ERROR); + exit_status = OPAL_ERROR; + goto cleanup; + } + cleanup: return exit_status; } @@ -1393,16 +1673,14 @@ static int snapc_full_local_start_ckpt_handshake_opts(orte_snapc_full_app_snapsh static int snapc_full_local_start_ckpt_handshake(orte_snapc_full_app_snapshot_t *vpid_snapshot) { int ret, exit_status = ORTE_SUCCESS; - char *local_dir = NULL; - int len, value; - ssize_t tmp_size = 0; + int value; /* * Wait for the appliation to respond */ if( sizeof(int) != (ret = read(vpid_snapshot->comm_pipe_r_fd, &value, sizeof(int))) ) { opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to read length from named pipe (%s). %d\n", + "Local) Error: Unable to read length from named pipe (%s). %d\n", vpid_snapshot->comm_pipe_r, ret); exit_status = OPAL_ERROR; goto cleanup; @@ -1443,99 +1721,18 @@ static int snapc_full_local_start_ckpt_handshake(orte_snapc_full_app_snapshot_t opal_event_add(&(vpid_snapshot->comm_pipe_r_eh), NULL); /* - * Send: Snapshot Name + * Let the application know that it can proceed */ - len = strlen(vpid_snapshot->super.reference_name) + 1; - if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &len, sizeof(int))) ) { + if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &value, sizeof(int))) ) { opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write snapshot name len (%d) to named pipe (%s). %d\n", - len, vpid_snapshot->comm_pipe_w, ret); - ORTE_ERROR_LOG(OPAL_ERROR); - exit_status = OPAL_ERROR; - goto cleanup; - } - - tmp_size = sizeof(char) * len; - if( tmp_size != (ret = write(vpid_snapshot->comm_pipe_w_fd, (vpid_snapshot->super.reference_name), (sizeof(char) * len))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write snapshot name (%s) to named pipe (%s). %d\n", - vpid_snapshot->super.reference_name, vpid_snapshot->comm_pipe_w, ret); - ORTE_ERROR_LOG(OPAL_ERROR); - exit_status = OPAL_ERROR; - goto cleanup; - } - - /* - * Send: Snapshot Location - */ - local_dir = strdup(vpid_snapshot->super.local_location); - local_dir = opal_dirname(local_dir); - len = strlen(local_dir) + 1; - if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &len, sizeof(int))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write snapshot location len (%d) to named pipe (%s). %d\n", - len, vpid_snapshot->comm_pipe_w, ret); - ORTE_ERROR_LOG(OPAL_ERROR); - exit_status = OPAL_ERROR; - goto cleanup; - } - - tmp_size = sizeof(char) * len; - if( tmp_size != (ret = write(vpid_snapshot->comm_pipe_w_fd, (local_dir), (sizeof(char) * len))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write snapshot location (%s) to named pipe (%s). %d\n", - local_dir, vpid_snapshot->comm_pipe_w, ret); - ORTE_ERROR_LOG(OPAL_ERROR); - exit_status = OPAL_ERROR; - goto cleanup; - } - - /* - * Send: Global Snapshot Ref - */ - if( NULL == global_ckpt_ref ) { - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - len = strlen(global_ckpt_ref) + 1; - if( sizeof(int) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &len, sizeof(int))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write global snapshot ref len (%d) to named pipe (%s). %d\n", - len, vpid_snapshot->comm_pipe_w, ret); - ORTE_ERROR_LOG(OPAL_ERROR); - exit_status = OPAL_ERROR; - goto cleanup; - } - - tmp_size = sizeof(char) * len; - if( tmp_size != (ret = write(vpid_snapshot->comm_pipe_w_fd, (global_ckpt_ref), (sizeof(char) * len))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write global snapshot ref (%s) to named pipe (%s). %d\n", - global_ckpt_ref, vpid_snapshot->comm_pipe_w, ret); - ORTE_ERROR_LOG(OPAL_ERROR); - exit_status = OPAL_ERROR; - goto cleanup; - } - - /* - * Send: Seq. Number - */ - if( sizeof(size_t) != (ret = write(vpid_snapshot->comm_pipe_w_fd, &orte_snapc_base_snapshot_seq_number, sizeof(size_t))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to write global snapshot seq number (%d) to named pipe (%s). %d\n", - (int)orte_snapc_base_snapshot_seq_number, vpid_snapshot->comm_pipe_w, ret); + "Local) Error: Unable to write to named pipe (%s). %d\n", + vpid_snapshot->comm_pipe_w, ret); ORTE_ERROR_LOG(OPAL_ERROR); exit_status = OPAL_ERROR; goto cleanup; } cleanup: - if( NULL != local_dir ) { - free(local_dir); - local_dir = NULL; - } - return exit_status; } @@ -1595,13 +1792,38 @@ static void snapc_full_local_comm_read_event(int fd, short flags, void *arg) /* * Get the final state of the checkpoint from the checkpointing process */ - if( sizeof(int) != (ret = read(vpid_snapshot->comm_pipe_r_fd, &ckpt_state, sizeof(int))) ) { - opal_output(mca_snapc_full_component.super.output_handle, - "local) Error: Unable to read state from named pipe (%s). %d\n", - vpid_snapshot->comm_pipe_r, ret); - ORTE_ERROR_LOG(ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; + if( !vpid_snapshot->migrating ) { + if( sizeof(int) != (ret = read(vpid_snapshot->comm_pipe_r_fd, &ckpt_state, sizeof(int))) ) { + opal_output(mca_snapc_full_component.super.output_handle, + "Local) Error: Unable to read state from named pipe (%s). %d\n", + vpid_snapshot->comm_pipe_r, ret); + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * If only doing INC Prep phase, then jump out + */ + if( local_global_snapshot.options->inc_prep_only && + OPAL_CRS_RUNNING == ckpt_state ) { + /* + * If all local procs are done, then tell the Global coord + */ + vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_INC_PREPED; + loc_min_state = snapc_full_get_min_state(); + if( loc_min_state > current_job_ckpt_state && + ORTE_SNAPC_CKPT_STATE_INC_PREPED == loc_min_state ) { + if( ORTE_SUCCESS != (ret = snapc_full_local_update_coord(ORTE_SNAPC_CKPT_STATE_INC_PREPED, false) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + /* Just return */ + return; + } } /* @@ -1614,6 +1836,26 @@ static void snapc_full_local_comm_read_event(int fd, short flags, void *arg) vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL; } + /* + * Flush GrpComm modex info if migrating + */ + if( currently_migrating && !flushed_modex ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Read Event: Flush the modex cached data\n")); + if (ORTE_SUCCESS != (ret = orte_grpcomm.purge_proc_attrs())) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + orte_grpcomm.finalize(); + if (ORTE_SUCCESS != (ret = orte_grpcomm.init())) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + flushed_modex = true; + } + /* * If error, then exit early */ @@ -1629,9 +1871,6 @@ static void snapc_full_local_comm_read_event(int fd, short flags, void *arg) /* * If all processes have finished locally, notify Global Coordinator - * if(FIN_LOCAL) { - * -- wait for the FIN from Global Coord -- - * } */ loc_min_state = snapc_full_get_min_state(); if( loc_min_state > current_job_ckpt_state && @@ -1644,21 +1883,46 @@ static void snapc_full_local_comm_read_event(int fd, short flags, void *arg) free(state_str); state_str = NULL; + /* + * Notify the Global Coordinator + */ current_job_ckpt_state = loc_min_state; - if( ORTE_SUCCESS != (ret = snapc_full_local_update_coord(loc_min_state, false) ) ) { - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; + if( ORTE_SNAPC_GLOBAL_COORD_TYPE != (orte_snapc_coord_type & ORTE_SNAPC_GLOBAL_COORD_TYPE)) { + if( ORTE_SUCCESS != (ret = snapc_full_local_update_coord(loc_min_state, false) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } } - } - /* - * If file transfer required just set state - * Global Coordinator does not need to be notified again since it can see - * these variables and knows what to do. - */ - if( !orte_snapc_base_store_in_place && !orte_snapc_full_skip_filem ) { - vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_FILE_XFER; + /* + * Sync the SStore + * If we stopped the process then we already did this + */ + if( !local_global_snapshot.options->stop ) { + orte_sstore.sync(local_global_snapshot.ss_handle); + sstore_local_sync_finished = true; + /* + * If the processes finished before we finished sync'ing + * then we need to cleanup. + */ + if( sstore_local_procs_finished ) { + if( ORTE_SUCCESS != (ret = orte_snapc_full_local_reset_coord()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + } + + /* If this process is also the global coord, then we have to update -after- we sync locally */ + if( ORTE_SNAPC_GLOBAL_COORD_TYPE == (orte_snapc_coord_type & ORTE_SNAPC_GLOBAL_COORD_TYPE)) { + if( ORTE_SUCCESS != (ret = snapc_full_local_update_coord(loc_min_state, false) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } } cleanup: @@ -1684,8 +1948,8 @@ static int snapc_full_get_min_state(void) char * state_str_a = NULL; char * state_str_b = NULL; - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; @@ -1700,7 +1964,8 @@ static int snapc_full_get_min_state(void) orte_snapc_ckpt_state_str(&state_str_b, min_state); OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, - "Local) ... Checking [%d %s] vs [%d %s]", + "Local) ... %s Checking [%d %s] vs [%d %s]", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), (int)vpid_snapshot->super.state, state_str_a, (int)min_state, state_str_b )); if( min_state > vpid_snapshot->super.state ) { @@ -1729,33 +1994,6 @@ static int snapc_full_get_min_state(void) return min_state; } -static int snapc_full_local_setup_snapshot_dir(char * snapshot_ref, char * sugg_dir, char **actual_dir) -{ - int ret, exit_status = ORTE_SUCCESS; - mode_t my_mode = S_IRWXU; - - /* See if we can use the suggested directory */ - if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(sugg_dir, my_mode) ) ) { - /* Can't use that directory, try the default directory from OPAL CRS */ - *actual_dir = strdup(opal_crs_base_get_snapshot_directory(snapshot_ref)); - - if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(*actual_dir, my_mode) ) ) { - /* Can't use that either, so let's give up */ - ORTE_ERROR_LOG(ret); - exit_status = ret; - goto cleanup; - } - } - else { - /* We are able to use that directory */ - *actual_dir = strdup(sugg_dir); - } - - cleanup: - return exit_status; -} - - static int snapc_full_local_get_vpids(void) { opal_list_item_t *item = NULL; @@ -1767,9 +2005,9 @@ static int snapc_full_local_get_vpids(void) * If the list is populated, and has updated pid information then * there is nothing to update. */ - list_len = opal_list_get_size(&snapc_local_vpids); + list_len = opal_list_get_size(&(local_global_snapshot.local_snapshots)); if( list_len > 0 ) { - vpid_snapshot = (orte_snapc_full_app_snapshot_t*)opal_list_get_first(&snapc_local_vpids); + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)opal_list_get_first(&(local_global_snapshot.local_snapshots)); if( 0 < vpid_snapshot->process_pid ) { return ORTE_SUCCESS; } @@ -1787,7 +2025,7 @@ static int snapc_full_local_get_vpids(void) if( 0 >= list_len || NULL == (vpid_snapshot = find_vpid_snapshot(child->name)) ) { vpid_snapshot = OBJ_NEW(orte_snapc_full_app_snapshot_t); - opal_list_append(&snapc_local_vpids, &(vpid_snapshot->super.super)); + opal_list_append(&(local_global_snapshot.local_snapshots), &(vpid_snapshot->super.super)); } /* Only update if the PID is -not- already set */ @@ -1801,13 +2039,89 @@ static int snapc_full_local_get_vpids(void) return ORTE_SUCCESS; } +static int snapc_full_local_refresh_vpids(void) +{ + opal_list_item_t *item = NULL, *v_item = NULL; + orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; + orte_odls_child_t *child = NULL; + bool found = false; + + /* + * First make sure that all of the vpids in the list are still our + * children (they may have moved) + */ + for(v_item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + v_item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + v_item = opal_list_get_next(v_item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)v_item; + + found = false; + for (item = opal_list_get_first(&orte_local_children); + item != opal_list_get_end(&orte_local_children); + item = opal_list_get_next(item)) { + child = (orte_odls_child_t*)item; + + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, + child->name, + &(vpid_snapshot->super.process_name) )) { + found = true; + break; + } + } + if( !found ) { + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Refresh List: Remove Process %s (%5d) from Daemon %s", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), + vpid_snapshot->process_pid, + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) )); + opal_list_remove_item(&(local_global_snapshot.local_snapshots), v_item); + } + } + + /* + * Next pass to find new processes that we are not already tracking + * (processes migrated to us). + */ + for (item = opal_list_get_first(&orte_local_children); + item != opal_list_get_end(&orte_local_children); + item = opal_list_get_next(item)) { + child = (orte_odls_child_t*)item; + + /* if the list is empty or this child is not in the list then add it */ + if( NULL == (vpid_snapshot = find_vpid_snapshot(child->name)) ) { + vpid_snapshot = OBJ_NEW(orte_snapc_full_app_snapshot_t); + + vpid_snapshot->process_pid = child->pid; + vpid_snapshot->super.process_name.jobid = child->name->jobid; + vpid_snapshot->super.process_name.vpid = child->name->vpid; + /*vpid_snapshot->migrating = true;*/ + + opal_list_append(&(local_global_snapshot.local_snapshots), &(vpid_snapshot->super.super)); + + OPAL_OUTPUT_VERBOSE((10, mca_snapc_full_component.super.output_handle, + "Local) Refresh List: Add Process %s (%5d) to Daemon %s", + ORTE_NAME_PRINT(&vpid_snapshot->super.process_name), + vpid_snapshot->process_pid, + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME) )); + } + /* Only update if the PID is -not- already set */ + else if( 0 >= vpid_snapshot->process_pid ) { + vpid_snapshot->process_pid = child->pid; + vpid_snapshot->super.process_name.jobid = child->name->jobid; + vpid_snapshot->super.process_name.vpid = child->name->vpid; + } + } + + return ORTE_SUCCESS; +} + static orte_snapc_full_app_snapshot_t *find_vpid_snapshot(orte_process_name_t *name ) { opal_list_item_t* item = NULL; orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; - for(item = opal_list_get_first(&snapc_local_vpids); - item != opal_list_get_end(&snapc_local_vpids); + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); item = opal_list_get_next(item) ) { vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; @@ -1820,3 +2134,45 @@ static orte_snapc_full_app_snapshot_t *find_vpid_snapshot(orte_process_name_t *n return NULL; } +static int orte_snapc_full_local_reset_coord(void) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_snapc_full_app_snapshot_t *vpid_snapshot = NULL; + opal_list_item_t* item = NULL; + + OPAL_OUTPUT_VERBOSE((15, mca_snapc_full_component.super.output_handle, + "Local) Job Ckpt finished - Cleanup\n")); + + for(item = opal_list_get_first(&(local_global_snapshot.local_snapshots)); + item != opal_list_get_end(&(local_global_snapshot.local_snapshots)); + item = opal_list_get_next(item) ) { + vpid_snapshot = (orte_snapc_full_app_snapshot_t*)item; + + /* If we forgot to close the pipes to the application, then do so + * now. It is rare that this would have happened, so more of a + * sanity check as part of cleanup. + */ + if( vpid_snapshot->comm_pipe_w_fd > 0 ) { + if( ORTE_SUCCESS != (ret = local_coord_job_state_update_finished_local_vpid(vpid_snapshot) ) ) { + ORTE_ERROR_LOG(ORTE_ERROR); + goto cleanup; + } + } + + vpid_snapshot->super.state = ORTE_SNAPC_CKPT_STATE_NONE; + } + + /* + * Clear globally cached options + */ + opal_crs_base_clear_options(local_global_snapshot.options); + + currently_migrating = false; + flushed_modex = false; + + sstore_local_sync_finished = false; + sstore_local_procs_finished = false; + + cleanup: + return exit_status; +} diff --git a/orte/mca/snapc/full/snapc_full_module.c b/orte/mca/snapc/full/snapc_full_module.c index b6c9a15a13..973c28bbbe 100644 --- a/orte/mca/snapc/full/snapc_full_module.c +++ b/orte/mca/snapc/full/snapc_full_module.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University. + * Copyright (c) 2004-2010 The Trustees of Indiana University. * All rights reserved. * Copyright (c) 2004-2005 The Trustees of the University of Tennessee. * All rights reserved. @@ -46,7 +46,8 @@ static orte_snapc_base_module_t loc_module = { orte_snapc_full_release_job, orte_snapc_full_ft_event, orte_snapc_full_start_ckpt, - orte_snapc_full_end_ckpt + orte_snapc_full_end_ckpt, + orte_snapc_full_request_op }; /* @@ -84,12 +85,6 @@ void orte_snapc_full_orted_construct(orte_snapc_full_orted_snapshot_t *snapshot) snapshot->process_name.vpid = 0; snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; - - snapshot->opal_crs = NULL; - - snapshot->options = OBJ_NEW(opal_crs_base_ckpt_options_t); - - snapshot->filem_request = NULL; } void orte_snapc_full_orted_destruct( orte_snapc_full_orted_snapshot_t *snapshot) { @@ -97,21 +92,6 @@ void orte_snapc_full_orted_destruct( orte_snapc_full_orted_snapshot_t *snapshot) snapshot->process_name.vpid = 0; snapshot->state = ORTE_SNAPC_CKPT_STATE_NONE; - - if( NULL != snapshot->opal_crs ) { - free( snapshot->opal_crs ); - snapshot->opal_crs = NULL; - } - - if( NULL != snapshot->options ) { - OBJ_RELEASE(snapshot->options); - snapshot->options = NULL; - } - - if( NULL != snapshot->filem_request ) { - OBJ_RELEASE(snapshot->filem_request); - snapshot->filem_request = NULL; - } } void orte_snapc_full_app_construct(orte_snapc_full_app_snapshot_t *app_snapshot) { @@ -122,10 +102,13 @@ void orte_snapc_full_app_construct(orte_snapc_full_app_snapshot_t *app_snapshot) app_snapshot->comm_pipe_w_fd = -1; app_snapshot->is_eh_active = false; + app_snapshot->unique_pipe_id = 0; app_snapshot->process_pid = 0; - app_snapshot->options = OBJ_NEW(opal_crs_base_ckpt_options_t); + app_snapshot->migrating = false; + + app_snapshot->finished = false; } void orte_snapc_full_app_destruct( orte_snapc_full_app_snapshot_t *app_snapshot) { @@ -143,13 +126,13 @@ void orte_snapc_full_app_destruct( orte_snapc_full_app_snapshot_t *app_snapshot) app_snapshot->comm_pipe_w_fd = -1; app_snapshot->is_eh_active = false; + app_snapshot->unique_pipe_id = 0; app_snapshot->process_pid = 0; - if( NULL != app_snapshot->options ) { - OBJ_RELEASE(app_snapshot->options); - app_snapshot->options = NULL; - } + app_snapshot->migrating = false; + + app_snapshot->finished = false; } /* @@ -325,7 +308,7 @@ int orte_snapc_full_start_ckpt(orte_snapc_base_quiesce_t *datum) ; /* Do nothing */ break; case ORTE_SNAPC_APP_COORD_TYPE: - return app_coord_start_ckpt(datum); + ; /* Do nothing. Use app_coord_request_op() instead */ break; default: break; @@ -345,7 +328,27 @@ int orte_snapc_full_end_ckpt(orte_snapc_base_quiesce_t *datum) ; /* Do nothing */ break; case ORTE_SNAPC_APP_COORD_TYPE: - return app_coord_end_ckpt(datum); + ; /* Do nothing. Use app_coord_request_op() instead */ + break; + default: + break; + } + + return ORTE_SUCCESS; +} + +int orte_snapc_full_request_op(orte_snapc_base_request_op_t *datum) +{ + switch(orte_snapc_coord_type) + { + case ORTE_SNAPC_GLOBAL_COORD_TYPE: + ; /* Do nothing */ + break; + case ORTE_SNAPC_LOCAL_COORD_TYPE: + ; /* Do nothing */ + break; + case ORTE_SNAPC_APP_COORD_TYPE: + return app_coord_request_op(datum); break; default: break; diff --git a/orte/mca/snapc/orte_snapc.7in b/orte/mca/snapc/orte_snapc.7in index 1b3560c046..363f9e97b5 100644 --- a/orte/mca/snapc/orte_snapc.7in +++ b/orte/mca/snapc/orte_snapc.7in @@ -1,5 +1,5 @@ .\" -.\" Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana +.\" Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana .\" University Research and Technology .\" Corporation. All rights reserved. .\" Copyright (c) 2008-2009 Sun Microsystems, Inc. All rights reserved. @@ -60,11 +60,7 @@ The following MCA parameters apply to all components: snapc_base_verbose Set the verbosity level for all components. Default is 0, or silent except on error. . -.TP -snapc_base_global_snapshot_dir -The directory to store the checkpoint snapshots. Default is \fB/tmp\fP. -. -.\" Self Component +.\" full Component .\" ****************** .SS full SnapC Component .PP diff --git a/orte/mca/snapc/snapc.h b/orte/mca/snapc/snapc.h index 41bfe0bccd..f53884d6c7 100644 --- a/orte/mca/snapc/snapc.h +++ b/orte/mca/snapc/snapc.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -82,8 +82,11 @@ #include "opal/mca/crs/base/base.h" #include "opal/class/opal_object.h" +#include "opal/class/opal_pointer_array.h" #include "opal/util/output.h" +#include "orte/mca/sstore/sstore.h" + BEGIN_C_DECLS /** @@ -100,17 +103,21 @@ BEGIN_C_DECLS #define ORTE_SNAPC_CKPT_STATE_PENDING 3 /* Running the checkpoint */ #define ORTE_SNAPC_CKPT_STATE_RUNNING 4 +/* INC Prep Finished */ +#define ORTE_SNAPC_CKPT_STATE_INC_PREPED 5 /* All Processes have been stopped */ -#define ORTE_SNAPC_CKPT_STATE_STOPPED 5 +#define ORTE_SNAPC_CKPT_STATE_STOPPED 6 /* Finished the checkpoint locally */ -#define ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL 6 -/* File Transfer in progress */ -#define ORTE_SNAPC_CKPT_STATE_FILE_XFER 7 -/* Finished the checkpoint */ -#define ORTE_SNAPC_CKPT_STATE_FINISHED 8 +#define ORTE_SNAPC_CKPT_STATE_FINISHED_LOCAL 7 +/* Migrating */ +#define ORTE_SNAPC_CKPT_STATE_MIGRATING 8 +/* Finished establishing the checkpoint */ +#define ORTE_SNAPC_CKPT_STATE_ESTABLISHED 9 +/* Processes continuing or have been recovered (finished post-INC) */ +#define ORTE_SNAPC_CKPT_STATE_RECOVERED 10 /* Unable to checkpoint this job */ -#define ORTE_SNAPC_CKPT_STATE_NO_CKPT 9 -#define ORTE_SNAPC_CKPT_MAX 10 +#define ORTE_SNAPC_CKPT_STATE_NO_CKPT 11 +#define ORTE_SNAPC_CKPT_MAX 12 /** * Definition of a orte local snapshot. @@ -127,18 +134,8 @@ struct orte_snapc_base_local_snapshot_1_0_0_t { /** State of the checkpoint */ int state; - /** Unique name of the local snapshot */ - char * reference_name; - - /** Local location of the local snapshot Absolute path */ - char * local_location; - - /** Remote location of the local snapshot Absolute path */ - char * remote_location; - - /** CRS agent */ - char * opal_crs; - + /** Stable Storage Handle (must equal the global version) */ + orte_sstore_base_handle_t ss_handle; }; typedef struct orte_snapc_base_local_snapshot_1_0_0_t orte_snapc_base_local_snapshot_1_0_0_t; typedef struct orte_snapc_base_local_snapshot_1_0_0_t orte_snapc_base_local_snapshot_t; @@ -156,21 +153,12 @@ struct orte_snapc_base_global_snapshot_1_0_0_t { /** A list of orte_snapc_base_snapshot_t's */ opal_list_t local_snapshots; - - /** Unique name of the global snapshot */ - char * reference_name; - - /** Location of the global snapshot Absolute path */ - char * local_location; - - /** Sequence Number */ - int seq_num; - /** Start Timestamp */ - char * start_time; + /** Checkpoint Options */ + opal_crs_base_ckpt_options_t *options; - /** End Timestamp */ - char * end_time; + /** Stable Storage Handle */ + orte_sstore_base_handle_t ss_handle; }; typedef struct orte_snapc_base_global_snapshot_1_0_0_t orte_snapc_base_global_snapshot_1_0_0_t; typedef struct orte_snapc_base_global_snapshot_1_0_0_t orte_snapc_base_global_snapshot_t; @@ -190,6 +178,11 @@ struct orte_snapc_base_quiesce_1_0_0_t { /** snapshot list */ orte_snapc_base_global_snapshot_t *snapshot; + /** Stable Storage Handle */ + orte_sstore_base_handle_t ss_handle; + /** Stable Storage Snapshot list */ + orte_sstore_base_global_snapshot_info_t *ss_snapshot; + /** Target Directory */ char * target_dir; /** Command Line */ @@ -200,12 +193,74 @@ struct orte_snapc_base_quiesce_1_0_0_t { bool checkpointing; /** Restarting? */ bool restarting; + + /** Migrating? */ + bool migrating; + /** List of migrating processes */ + int num_migrating; + opal_pointer_array_t migrating_procs; }; typedef struct orte_snapc_base_quiesce_1_0_0_t orte_snapc_base_quiesce_1_0_0_t; typedef struct orte_snapc_base_quiesce_1_0_0_t orte_snapc_base_quiesce_t; ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_snapc_base_quiesce_t); +/** + * Application request for a global checkpoint related operation + */ +typedef enum { + ORTE_SNAPC_OP_NONE = 0, + ORTE_SNAPC_OP_INIT, + ORTE_SNAPC_OP_FIN, + ORTE_SNAPC_OP_FIN_ACK, + ORTE_SNAPC_OP_CHECKPOINT, + ORTE_SNAPC_OP_RESTART, + ORTE_SNAPC_OP_MIGRATE, + ORTE_SNAPC_OP_QUIESCE_START, + ORTE_SNAPC_OP_QUIESCE_CHECKPOINT, + ORTE_SNAPC_OP_QUIESCE_END +} orte_snapc_base_request_op_event_t; + +struct orte_snapc_base_request_op_1_0_0_t { + /** Parent is an object type */ + opal_object_t super; + + /** Event to request */ + orte_snapc_base_request_op_event_t event; + + /** Is this request still active */ + bool is_active; + + /** Leader of the operation */ + int leader; + + /** Sequence Number */ + int seq_num; + + /** Global Handle */ + char * global_handle; + + /** Stable Storage Handle */ + orte_sstore_base_handle_t ss_handle; + + /** Migrating vpid list of participants */ + int mig_num; + int *mig_vpids; + + /** Migrating hostname preference list */ + char (*mig_host_pref)[OPAL_MAX_PROCESSOR_NAME]; + + /** Migrating vpid preference list */ + int *mig_vpid_pref; + + /** Info key */ + int *mig_off_node; +}; +typedef struct orte_snapc_base_request_op_1_0_0_t orte_snapc_base_request_op_1_0_0_t; +typedef struct orte_snapc_base_request_op_1_0_0_t orte_snapc_base_request_op_t; + +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_snapc_base_request_op_t); + /** * Module initialization function. * Returns ORTE_SUCCESS @@ -267,6 +322,12 @@ typedef int (*orte_snapc_base_start_checkpoint_fn_t) typedef int (*orte_snapc_base_end_checkpoint_fn_t) (orte_snapc_base_quiesce_t *datum); +/** + * Request a checkpoint related operation to take place + */ +typedef int (*orte_snapc_base_request_op_fn_t) + (orte_snapc_base_request_op_t *datum); + /** * Structure for SNAPC components. */ @@ -303,6 +364,8 @@ struct orte_snapc_base_module_1_0_0_t { /** Handle internal request for checkpoint */ orte_snapc_base_start_checkpoint_fn_t start_ckpt; orte_snapc_base_end_checkpoint_fn_t end_ckpt; + /** Handle a checkpoint related request */ + orte_snapc_base_request_op_fn_t request_op; }; typedef struct orte_snapc_base_module_1_0_0_t orte_snapc_base_module_1_0_0_t; typedef struct orte_snapc_base_module_1_0_0_t orte_snapc_base_module_t; diff --git a/orte/mca/sstore/Makefile.am b/orte/mca/sstore/Makefile.am new file mode 100644 index 0000000000..02b59966c5 --- /dev/null +++ b/orte/mca/sstore/Makefile.am @@ -0,0 +1,46 @@ +# +# Copyright (c) 2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +include $(top_srcdir)/Makefile.man-page-rules + +# main library setup +noinst_LTLIBRARIES = libmca_sstore.la +libmca_sstore_la_SOURCES = + +# header setup +nobase_orte_HEADERS = + +# local files +headers = sstore.h +libmca_sstore_la_SOURCES += $(headers) + +# Manual pages +nodist_man_MANS = orte_sstore.7 +EXTRA_DIST = $(nodist_man_MANS:.7=.7in) + +# Ensure that the man pages are rebuilt if the opal_config.h file +# changes; a "good enough" way to know if configure was run again (and +# therefore the release date or version may have changed) +$(nodist_man_MANS): $(top_builddir)/opal/include/opal_config.h + +# Conditionally install the header files +if WANT_INSTALL_HEADERS +nobase_orte_HEADERS += $(headers) +ortedir = $(includedir)/openmpi/orte/mca/sstore +else +ortedir = $(includedir) +endif + +include base/Makefile.am + +distclean-local: + rm -f base/static-components.h + rm -f $(nodist_man_MANS) diff --git a/orte/mca/sstore/base/Makefile.am b/orte/mca/sstore/base/Makefile.am new file mode 100644 index 0000000000..695ebcb699 --- /dev/null +++ b/orte/mca/sstore/base/Makefile.am @@ -0,0 +1,26 @@ +# +# Copyright (c) 2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +headers += \ + base/base.h + +libmca_sstore_la_SOURCES += \ + base/sstore_base_open.c + +if !ORTE_DISABLE_FULL_SUPPORT + +dist_pkgdata_DATA = base/help-orte-sstore-base.txt + +libmca_sstore_la_SOURCES += \ + base/sstore_base_close.c \ + base/sstore_base_select.c \ + base/sstore_base_fns.c +endif diff --git a/orte/mca/sstore/base/base.h b/orte/mca/sstore/base/base.h new file mode 100644 index 0000000000..3198345dee --- /dev/null +++ b/orte/mca/sstore/base/base.h @@ -0,0 +1,147 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +#ifndef ORTE_SSTORE_BASE_H +#define ORTE_SSTORE_BASE_H + +#include "orte_config.h" + +#if !ORTE_DISABLE_FULL_SUPPORT +#include "orte/mca/rml/rml.h" +#endif + +#include "orte/mca/sstore/sstore.h" + +/* + * Global functions for MCA overall SStore + */ + +BEGIN_C_DECLS + +/** + * Initialize the SSTORE MCA framework + * + * @retval ORTE_SUCCESS Upon success + * @retval ORTE_ERROR Upon failures + * + * This function is invoked during orte_init(); + */ +ORTE_DECLSPEC int orte_sstore_base_open(void); + +#if !ORTE_DISABLE_FULL_SUPPORT + /** + * Select an available component. + * + * @retval ORTE_SUCCESS Upon Success + * @retval ORTE_NOT_FOUND If no component can be selected + * @retval ORTE_ERROR Upon other failure + * + */ + ORTE_DECLSPEC int orte_sstore_base_select(void); + + /** + * Finalize the SSTORE MCA framework + * + * @retval ORTE_SUCCESS Upon success + * @retval ORTE_ERROR Upon failures + * + * This function is invoked during orte_finalize(); + */ + ORTE_DECLSPEC int orte_sstore_base_close(void); + + /** + * Object stuff + */ + void orte_sstore_base_local_snapshot_info_construct(orte_sstore_base_local_snapshot_info_t *snapshot); + void orte_sstore_base_local_snapshot_info_destruct( orte_sstore_base_local_snapshot_info_t *snapshot); + + void orte_sstore_base_global_snapshot_info_construct(orte_sstore_base_global_snapshot_info_t *snapshot); + void orte_sstore_base_global_snapshot_info_destruct( orte_sstore_base_global_snapshot_info_t *snapshot); + + /** + * Globals + */ + ORTE_DECLSPEC extern int orte_sstore_base_output; + ORTE_DECLSPEC extern opal_list_t orte_sstore_base_components_available; + ORTE_DECLSPEC extern orte_sstore_base_component_t orte_sstore_base_selected_component; + ORTE_DECLSPEC extern orte_sstore_base_module_t orte_sstore; + + /* + * Context of this module + */ +#define ORTE_SSTORE_UNASSIGN_TYPE 0 +#define ORTE_SSTORE_GLOBAL_TYPE 1 +#define ORTE_SSTORE_LOCAL_TYPE 2 +#define ORTE_SSTORE_TOOL_TYPE 4 +#define ORTE_SSTORE_APP_TYPE 8 + ORTE_DECLSPEC extern int orte_sstore_context; + + /** + * Snapshot metadata + */ +#define SSTORE_METADATA_LOCAL_CRS_COMP_STR CRS_METADATA_COMP +#define SSTORE_METADATA_LOCAL_PID_STR CRS_METADATA_PID +#define SSTORE_METADATA_LOCAL_CONTEXT_STR CRS_METADATA_CONTEXT +#define SSTORE_METADATA_LOCAL_MKDIR_STR CRS_METADATA_MKDIR +#define SSTORE_METADATA_LOCAL_TOUCH_STR CRS_METADATA_TOUCH + +#define SSTORE_METADATA_LOCAL_COMPRESS_COMP_STR ("# OPAL Compress Component: ") +#define SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX_STR ("# OPAL Compress Postfix: ") + +#define SSTORE_METADATA_LOCAL_SNAP_REF_FMT_STR ("# Local Snapshot Format Reference: ") +#define SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR ("# Seq: ") +#define SSTORE_METADATA_GLOBAL_AMCA_PARAM_STR ("# AMCA: ") + +#define SSTORE_METADATA_INTERNAL_DONE_SEQ_STR ("# Finished Seq: ") +#define SSTORE_METADATA_INTERNAL_TIME_STR ("# Timestamp: ") +#define SSTORE_METADATA_INTERNAL_PROCESS_STR ("# Process: ") + +#define SSTORE_METADATA_INTERNAL_MIG_SEQ_STR ("# Migrate Seq: ") +#define SSTORE_METADATA_INTERNAL_DONE_MIG_SEQ_STR ("# Finished Migrate Seq: ") + + + /** + * Some utility functions + */ + ORTE_DECLSPEC extern bool orte_sstore_base_is_checkpoint_available; + ORTE_DECLSPEC extern char * orte_sstore_base_local_metadata_filename; + ORTE_DECLSPEC extern char * orte_sstore_base_global_metadata_filename; + ORTE_DECLSPEC extern char * orte_sstore_base_local_snapshot_fmt; + ORTE_DECLSPEC extern char * orte_sstore_base_global_snapshot_dir; + ORTE_DECLSPEC extern char * orte_sstore_base_global_snapshot_ref; + ORTE_DECLSPEC extern char * orte_sstore_base_prelaunch_location; + + ORTE_DECLSPEC int orte_sstore_base_get_global_snapshot_ref(char **name_str, pid_t pid); + + ORTE_DECLSPEC int orte_sstore_base_convert_key_to_string(orte_sstore_base_key_t key, char **key_str); + ORTE_DECLSPEC int orte_sstore_base_convert_string_to_key(char *key_str, orte_sstore_base_key_t *key); + + ORTE_DECLSPEC int orte_sstore_base_metadata_read_next_seq_num(FILE *file); + ORTE_DECLSPEC int orte_sstore_base_metadata_read_next_token(FILE *file, char **token, char **value); + ORTE_DECLSPEC int orte_sstore_base_metadata_seek_to_seq_num(FILE *file, int seq_num); + + ORTE_DECLSPEC int orte_sstore_base_extract_global_metadata(orte_sstore_base_global_snapshot_info_t *global_snapshot); + ORTE_DECLSPEC int orte_sstore_base_get_all_snapshots(opal_list_t *all_snapshots, char *basedir); + ORTE_DECLSPEC int orte_sstore_base_find_largest_seq_num(orte_sstore_base_global_snapshot_info_t *global_snapshot, int *seq_num); + ORTE_DECLSPEC int orte_sstore_base_find_all_seq_nums(orte_sstore_base_global_snapshot_info_t *global_snapshot, int *num_seq, char ***seq_list); + +/* + * Common Tool functionality for interfacing with orte-restart/checkpoint + */ +ORTE_DECLSPEC int orte_sstore_base_tool_request_restart_handle(orte_sstore_base_handle_t *handle, + char *basedir, char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot); +ORTE_DECLSPEC int orte_sstore_base_tool_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); + +#endif /* ORTE_DISABLE_FULL_SUPPORT */ + +END_C_DECLS + +#endif /* ORTE_SSTORE_BASE_H */ diff --git a/orte/mca/sstore/base/help-orte-sstore-base.txt b/orte/mca/sstore/base/help-orte-sstore-base.txt new file mode 100644 index 0000000000..7cd3616cbb --- /dev/null +++ b/orte/mca/sstore/base/help-orte-sstore-base.txt @@ -0,0 +1,13 @@ + -*- text -*- +# +# Copyright (c) 2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for ORTE SStore framework. +# diff --git a/orte/mca/sstore/base/sstore_base_close.c b/orte/mca/sstore/base/sstore_base_close.c new file mode 100644 index 0000000000..0130483c7e --- /dev/null +++ b/orte/mca/sstore/base/sstore_base_close.c @@ -0,0 +1,35 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" +#include "orte/constants.h" + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" + +#include "opal/mca/base/mca_base_param.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +int orte_sstore_base_close(void) +{ + /* Close the selected component */ + if( NULL != orte_sstore.sstore_finalize ) { + orte_sstore.sstore_finalize(); + } + + /* Close all available modules that are open */ + mca_base_components_close(orte_sstore_base_output, + &orte_sstore_base_components_available, + NULL); + + return ORTE_SUCCESS; +} diff --git a/orte/mca/sstore/base/sstore_base_fns.c b/orte/mca/sstore/base/sstore_base_fns.c new file mode 100644 index 0000000000..fad2314d88 --- /dev/null +++ b/orte/mca/sstore/base/sstore_base_fns.c @@ -0,0 +1,956 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#ifdef HAVE_SYS_TYPES_H +#include +#endif +#ifdef HAVE_UNISTD_H +#include +#endif +#ifdef HAVE_SYS_STAT_H +#include +#endif /* HAVE_SYS_STAT_H */ +#ifdef HAVE_DIRENT_H +#include +#endif /* HAVE_DIRENT_H */ +#include + +#include "orte/constants.h" + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/util/argv.h" +#include "opal/mca/base/mca_base_param.h" + +#include "orte/mca/rml/rml.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/runtime/orte_globals.h" +#include "orte/util/proc_info.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +/****************** + * Local Functions + ******************/ + +/****************** + * Object Stuff + ******************/ +OBJ_CLASS_INSTANCE(orte_sstore_base_local_snapshot_info_t, + opal_list_item_t, + orte_sstore_base_local_snapshot_info_construct, + orte_sstore_base_local_snapshot_info_destruct); + +void orte_sstore_base_local_snapshot_info_construct(orte_sstore_base_local_snapshot_info_t *snapshot) +{ + snapshot->process_name.jobid = 0; + snapshot->process_name.vpid = 0; + + snapshot->crs_comp = NULL; + snapshot->compress_comp = NULL; + snapshot->compress_postfix = NULL; + + snapshot->start_time = NULL; + snapshot->end_time = NULL; +} + +void orte_sstore_base_local_snapshot_info_destruct( orte_sstore_base_local_snapshot_info_t *snapshot) +{ + snapshot->process_name.jobid = 0; + snapshot->process_name.vpid = 0; + + if( NULL != snapshot->crs_comp ) { + free(snapshot->crs_comp); + snapshot->crs_comp = NULL; + } + + if( NULL != snapshot->compress_comp ) { + free(snapshot->compress_comp); + snapshot->compress_comp = NULL; + } + + if( NULL != snapshot->compress_postfix ) { + free(snapshot->compress_postfix); + snapshot->compress_postfix = NULL; + } + + if( NULL != snapshot->start_time ) { + free(snapshot->start_time); + snapshot->start_time = NULL; + } + + if( NULL != snapshot->end_time ) { + free(snapshot->end_time); + snapshot->end_time = NULL; + } +} + +OBJ_CLASS_INSTANCE(orte_sstore_base_global_snapshot_info_t, + opal_list_item_t, + orte_sstore_base_global_snapshot_info_construct, + orte_sstore_base_global_snapshot_info_destruct); + +void orte_sstore_base_global_snapshot_info_construct(orte_sstore_base_global_snapshot_info_t *snapshot) +{ + OBJ_CONSTRUCT(&(snapshot->local_snapshots), opal_list_t); + + snapshot->ss_handle = ORTE_SSTORE_HANDLE_INVALID; + + snapshot->start_time = NULL; + snapshot->end_time = NULL; + + snapshot->seq_num = -1; + + snapshot->num_seqs = 0; + snapshot->all_seqs = NULL; + snapshot->basedir = NULL; + snapshot->reference = NULL; + snapshot->amca_param = NULL; + snapshot->metadata_filename = NULL; +} + +void orte_sstore_base_global_snapshot_info_destruct( orte_sstore_base_global_snapshot_info_t *snapshot) +{ + opal_list_item_t* item = NULL; + + while (NULL != (item = opal_list_remove_first(&snapshot->local_snapshots))) { + OBJ_RELEASE(item); + } + OBJ_DESTRUCT(&(snapshot->local_snapshots)); + + snapshot->ss_handle = ORTE_SSTORE_HANDLE_INVALID; + + if( NULL != snapshot->start_time ) { + free(snapshot->start_time); + snapshot->start_time = NULL; + } + + if( NULL != snapshot->end_time ) { + free(snapshot->end_time); + snapshot->end_time = NULL; + } + + snapshot->seq_num = -1; + + snapshot->num_seqs = 0; + + if( NULL != snapshot->all_seqs ) { + opal_argv_free(snapshot->all_seqs); + snapshot->all_seqs = NULL; + } + + if( NULL != snapshot->basedir ) { + free(snapshot->basedir); + snapshot->basedir = NULL; + } + + if( NULL != snapshot->reference ) { + free(snapshot->reference); + snapshot->reference = NULL; + } + + if( NULL != snapshot->amca_param ) { + free(snapshot->amca_param); + snapshot->amca_param = NULL; + } + + if( NULL != snapshot->metadata_filename ) { + free(snapshot->metadata_filename); + snapshot->metadata_filename = NULL; + } +} + +/*************** + * Tool interface functionality + ***************/ +static orte_sstore_base_global_snapshot_info_t *tool_global_snapshot = NULL; + +int orte_sstore_base_tool_request_restart_handle(orte_sstore_base_handle_t *handle, + char *basedir, char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + int ret, exit_status = ORTE_SUCCESS; + char * tmp_str = NULL; + + if( NULL != tool_global_snapshot ) { + OBJ_RELEASE(tool_global_snapshot); + } + tool_global_snapshot = snapshot; + OBJ_RETAIN(tool_global_snapshot); + + snapshot->reference = strdup(ref); + if( NULL == basedir ) { + snapshot->basedir = strdup(orte_sstore_base_global_snapshot_dir); + } else { + snapshot->basedir = strdup(basedir); + } + asprintf(&(snapshot->metadata_filename), + "%s/%s/%s", + snapshot->basedir, + snapshot->reference, + orte_sstore_base_global_metadata_filename); + + /* + * Check the checkpoint location + */ + asprintf(&tmp_str, "%s/%s", + snapshot->basedir, + snapshot->reference); + if (0 > (ret = access(tmp_str, F_OK)) ) { + opal_output(0, ("Error: The snapshot requested does not exist!\n" + "Check the path (%s)!"), + tmp_str); + exit_status = ORTE_ERROR; + goto cleanup; + } + if(NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + /* + * If we were asked to find the largest seq num + */ + if( seq < 0 ) { + if( ORTE_SUCCESS != (ret = orte_sstore_base_find_largest_seq_num(snapshot, &seq)) ) { + opal_output(0, ("Error: Failed to find a valid sequence number in snapshot metadata!\n" + "Check the metadata file (%s)!"), + snapshot->metadata_filename); + exit_status = ORTE_ERROR; + goto cleanup; + } + snapshot->seq_num = seq; + } else { + snapshot->seq_num = seq; + } + + /* + * Check the checkpoint sequence location + */ + asprintf(&tmp_str, "%s/%s/%d", + snapshot->basedir, + snapshot->reference, + snapshot->seq_num); + if (0 > (ret = access(tmp_str, F_OK)) ) { + opal_output(0, ("Error: The snapshot sequence requested does not exist!\n" + "Check the path (%s)!"), + tmp_str); + exit_status = ORTE_ERROR; + goto cleanup; + } + if(NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + /* + * Build the list of processes attached to the snapshot + */ + if( ORTE_SUCCESS != (ret = orte_sstore_base_extract_global_metadata(snapshot)) ) { + opal_output(0, "Error: Failed to extract process information! Check the metadata file in (%s)!", + tmp_str); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * Save some basic infomation + */ + snapshot->ss_handle = 1; + *handle = 1; + + cleanup: + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + return exit_status; +} + +int orte_sstore_base_tool_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + int ret, exit_status = ORTE_SUCCESS; + + if( SSTORE_METADATA_GLOBAL_SNAP_LOC_ABS == key ) { + asprintf(value, "%s/%s", + tool_global_snapshot->basedir, + tool_global_snapshot->reference); + } + else if( SSTORE_METADATA_LOCAL_SNAP_REF_FMT == key ) { + *value = strdup(orte_sstore_base_local_snapshot_fmt); + } + else if( SSTORE_METADATA_LOCAL_SNAP_LOC == key ) { + asprintf(value, "%s/%s/%d", + tool_global_snapshot->basedir, + tool_global_snapshot->reference, + tool_global_snapshot->seq_num); + } + else if( SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT == key ) { + asprintf(value, "%s/%s/%d/%s", + tool_global_snapshot->basedir, + tool_global_snapshot->reference, + tool_global_snapshot->seq_num, + orte_sstore_base_local_snapshot_fmt); + } + else if( SSTORE_METADATA_GLOBAL_SNAP_NUM_SEQ == key ) { + if( NULL == tool_global_snapshot->all_seqs ) { + if( ORTE_SUCCESS != (ret = orte_sstore_base_find_all_seq_nums(tool_global_snapshot, + &(tool_global_snapshot->num_seqs), + &(tool_global_snapshot->all_seqs)))) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + } + asprintf(value, "%d", tool_global_snapshot->num_seqs); + } + else if( SSTORE_METADATA_GLOBAL_SNAP_ALL_SEQ == key ) { + if( NULL == tool_global_snapshot->all_seqs ) { + if( ORTE_SUCCESS != (ret = orte_sstore_base_find_all_seq_nums(tool_global_snapshot, + &(tool_global_snapshot->num_seqs), + &(tool_global_snapshot->all_seqs)))) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + } + *value = opal_argv_join(tool_global_snapshot->all_seqs, ','); + } + else if( SSTORE_METADATA_GLOBAL_AMCA_PARAM == key ) { + *value = strdup(tool_global_snapshot->amca_param); + } + else { + return ORTE_ERR_NOT_SUPPORTED; + } + + cleanup: + return exit_status; +} + +/******************** + * Utility functions + ********************/ +int orte_sstore_base_get_global_snapshot_ref(char **name_str, pid_t pid) +{ + if( NULL == orte_sstore_base_global_snapshot_ref ) { + asprintf(name_str, "ompi_global_snapshot_%d.ckpt", pid); + } + else { + *name_str = strdup(orte_sstore_base_global_snapshot_ref); + } + + return ORTE_SUCCESS; +} + +int orte_sstore_base_convert_key_to_string(orte_sstore_base_key_t key, char **key_str) +{ + switch(key) { + case SSTORE_METADATA_LOCAL_CRS_COMP: + *key_str = strdup(SSTORE_METADATA_LOCAL_CRS_COMP_STR); + break; + case SSTORE_METADATA_LOCAL_COMPRESS_COMP: + *key_str = strdup(SSTORE_METADATA_LOCAL_COMPRESS_COMP_STR); + break; + case SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX: + *key_str = strdup(SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX_STR); + break; + case SSTORE_METADATA_LOCAL_PID: + *key_str = strdup(SSTORE_METADATA_LOCAL_PID_STR); + break; + case SSTORE_METADATA_LOCAL_CONTEXT: + *key_str = strdup(SSTORE_METADATA_LOCAL_CONTEXT_STR); + break; + case SSTORE_METADATA_LOCAL_MKDIR: + *key_str = strdup(SSTORE_METADATA_LOCAL_MKDIR_STR); + break; + case SSTORE_METADATA_LOCAL_TOUCH: + *key_str = strdup(SSTORE_METADATA_LOCAL_TOUCH_STR); + break; + case SSTORE_METADATA_LOCAL_SNAP_REF: + *key_str = NULL; + break; + case SSTORE_METADATA_LOCAL_SNAP_REF_FMT: + *key_str = strdup(SSTORE_METADATA_LOCAL_SNAP_REF_FMT_STR); + break; + case SSTORE_METADATA_LOCAL_SNAP_LOC: + *key_str = NULL; + break; + case SSTORE_METADATA_LOCAL_SNAP_META: + *key_str = NULL; + break; + case SSTORE_METADATA_GLOBAL_SNAP_REF: + *key_str = NULL; + break; + case SSTORE_METADATA_GLOBAL_SNAP_LOC: + *key_str = NULL; + break; + case SSTORE_METADATA_GLOBAL_SNAP_LOC_ABS: + *key_str = NULL; + break; + case SSTORE_METADATA_GLOBAL_SNAP_META: + *key_str = NULL; + break; + case SSTORE_METADATA_GLOBAL_SNAP_SEQ: + *key_str = strdup(SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR); + break; + case SSTORE_METADATA_GLOBAL_AMCA_PARAM: + *key_str = strdup(SSTORE_METADATA_GLOBAL_AMCA_PARAM_STR); + break; + default: + *key_str = NULL; + break; + } + + return ORTE_SUCCESS; +} + +int orte_sstore_base_convert_string_to_key(char *key_str, orte_sstore_base_key_t *key) +{ + if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_CRS_COMP_STR, strlen(SSTORE_METADATA_LOCAL_CRS_COMP_STR))) { + *key = SSTORE_METADATA_LOCAL_CRS_COMP; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_COMPRESS_COMP_STR, strlen(SSTORE_METADATA_LOCAL_COMPRESS_COMP_STR))) { + *key = SSTORE_METADATA_LOCAL_COMPRESS_COMP; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX_STR, strlen(SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX_STR))) { + *key = SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_PID_STR, strlen(SSTORE_METADATA_LOCAL_PID_STR))) { + *key = SSTORE_METADATA_LOCAL_PID; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_CONTEXT_STR, strlen(SSTORE_METADATA_LOCAL_CONTEXT_STR))) { + *key = SSTORE_METADATA_LOCAL_CONTEXT; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_MKDIR_STR, strlen(SSTORE_METADATA_LOCAL_MKDIR_STR))) { + *key = SSTORE_METADATA_LOCAL_MKDIR; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_TOUCH_STR, strlen(SSTORE_METADATA_LOCAL_TOUCH_STR))) { + *key = SSTORE_METADATA_LOCAL_TOUCH; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_LOCAL_SNAP_REF_FMT_STR, strlen(SSTORE_METADATA_LOCAL_SNAP_REF_FMT_STR))) { + *key = SSTORE_METADATA_LOCAL_SNAP_REF_FMT; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR, strlen(SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR))) { + *key = SSTORE_METADATA_GLOBAL_SNAP_SEQ; + } + else if( 0 == strncmp(key_str, SSTORE_METADATA_GLOBAL_AMCA_PARAM_STR, strlen(SSTORE_METADATA_GLOBAL_AMCA_PARAM_STR))) { + *key = SSTORE_METADATA_GLOBAL_AMCA_PARAM; + } + else { + *key = SSTORE_METADATA_MAX; + } + + return ORTE_SUCCESS; +} + +int orte_sstore_base_get_all_snapshots(opal_list_t *all_snapshots, char *basedir) +{ +#ifndef HAVE_DIRENT_H + return ORTE_ERR_NOT_SUPPORTED; +#else + int ret, exit_status = ORTE_SUCCESS; + char *loc_basedir = NULL; + char * tmp_str = NULL, * metadata_file = NULL; + DIR *dirp = NULL; + struct dirent *dir_entp = NULL; + struct stat file_status; + orte_sstore_base_global_snapshot_info_t *global_snapshot = NULL; + + /* Sanity check */ + if( NULL == all_snapshots || + (NULL == orte_sstore_base_global_snapshot_dir && NULL == basedir)) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if( NULL == basedir ) { + loc_basedir = strdup(orte_sstore_base_global_snapshot_dir); + } else { + loc_basedir = strdup(basedir); + } + + /* + * Get all subdirectories under the base directory + */ + dirp = opendir(loc_basedir); + while( NULL != (dir_entp = readdir(dirp))) { + /* Skip "." and ".." if they are in the list */ + if( 0 == strncmp("..", dir_entp->d_name, strlen("..") ) || + 0 == strncmp(".", dir_entp->d_name, strlen(".") ) ) { + continue; + } + + /* Add the full path */ + asprintf(&tmp_str, "%s/%s", loc_basedir, dir_entp->d_name); + if(0 != (ret = stat(tmp_str, &file_status) ) ){ + free( tmp_str); + tmp_str = NULL; + continue; + } else { + /* Is it a directory? */ + if(S_ISDIR(file_status.st_mode) ) { + asprintf(&metadata_file, "%s/%s", + tmp_str, + orte_sstore_base_global_metadata_filename); + if(0 != (ret = stat(metadata_file, &file_status) ) ){ + free( tmp_str); + tmp_str = NULL; + free( metadata_file); + metadata_file = NULL; + continue; + } else { + if(S_ISREG(file_status.st_mode) ) { + global_snapshot = OBJ_NEW(orte_sstore_base_global_snapshot_info_t); + + global_snapshot->ss_handle = 1; + global_snapshot->basedir = strdup(loc_basedir); + asprintf(&(global_snapshot->reference), + "%s", + dir_entp->d_name); + asprintf(&(global_snapshot->metadata_filename), + "%s/%s/%s", + global_snapshot->basedir, + global_snapshot->reference, + orte_sstore_base_global_metadata_filename); + + opal_list_append(all_snapshots, &(global_snapshot->super)); + } + } + free( metadata_file); + metadata_file = NULL; + } + } + + free( tmp_str); + tmp_str = NULL; + } + + closedir(dirp); + + cleanup: + if( NULL != loc_basedir ) { + free(loc_basedir); + loc_basedir = NULL; + } + + if( NULL != tmp_str) { + free( tmp_str); + tmp_str = NULL; + } + + return exit_status; +#endif /* HAVE_DIRENT_H */ +} + +int orte_sstore_base_extract_global_metadata(orte_sstore_base_global_snapshot_info_t *global_snapshot) +{ + int ret, exit_status = ORTE_SUCCESS; + FILE *metadata = NULL; + char * token = NULL; + char * value = NULL; + orte_process_name_t proc; + opal_list_item_t* item = NULL; + orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL; + + /* + * Cleanup the structure a bit, so we can refresh it below + */ + while (NULL != (item = opal_list_remove_first(&global_snapshot->local_snapshots))) { + OBJ_RELEASE(item); + } + + if( NULL != global_snapshot->start_time ) { + free( global_snapshot->start_time ); + global_snapshot->start_time = NULL; + } + + if( NULL != global_snapshot->end_time ) { + free( global_snapshot->end_time ); + global_snapshot->end_time = NULL; + } + + /* + * Open the metadata file + */ + if (NULL == (metadata = fopen(global_snapshot->metadata_filename, "r")) ) { + opal_output(orte_sstore_base_output, + "sstore:base:extract_global_metadata() Unable to open the file (%s)\n", + global_snapshot->metadata_filename); + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * Seek to the sequence number requested + */ + if( ORTE_SUCCESS != (ret = orte_sstore_base_metadata_seek_to_seq_num(metadata, global_snapshot->seq_num))) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * Extract each token and make the records + */ + do { + if( ORTE_SUCCESS != orte_sstore_base_metadata_read_next_token(metadata, &token, &value) ) { + break; + } + + if(0 == strncmp(token, SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR, strlen(SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR)) || + 0 == strncmp(token, SSTORE_METADATA_INTERNAL_MIG_SEQ_STR, strlen(SSTORE_METADATA_INTERNAL_MIG_SEQ_STR)) ) { + break; + } + + if( 0 == strncmp(token, SSTORE_METADATA_INTERNAL_PROCESS_STR, strlen(SSTORE_METADATA_INTERNAL_PROCESS_STR)) ) { + orte_util_convert_string_to_process_name(&proc, value); + + /* Not the first process, so append it to the list */ + if( NULL != vpid_snapshot) { + opal_list_append(&global_snapshot->local_snapshots, &(vpid_snapshot->super)); + } + + vpid_snapshot = OBJ_NEW(orte_sstore_base_local_snapshot_info_t); + vpid_snapshot->ss_handle = global_snapshot->ss_handle; + + vpid_snapshot->process_name.jobid = proc.jobid; + vpid_snapshot->process_name.vpid = proc.vpid; + } + else if(0 == strncmp(token, SSTORE_METADATA_LOCAL_CRS_COMP_STR, strlen(SSTORE_METADATA_LOCAL_CRS_COMP_STR))) { + vpid_snapshot->crs_comp = strdup(value); + } + else if(0 == strncmp(token, SSTORE_METADATA_LOCAL_COMPRESS_COMP_STR, strlen(SSTORE_METADATA_LOCAL_COMPRESS_COMP_STR))) { + vpid_snapshot->compress_comp = strdup(value); + } + else if(0 == strncmp(token, SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX_STR, strlen(SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX_STR))) { + vpid_snapshot->compress_postfix = strdup(value); + } + else if(0 == strncmp(token, SSTORE_METADATA_INTERNAL_TIME_STR, strlen(SSTORE_METADATA_INTERNAL_TIME_STR)) ) { + if( NULL == global_snapshot->start_time) { + global_snapshot->start_time = strdup(value); + } + else { + global_snapshot->end_time = strdup(value); + } + } + else if(0 == strncmp(token, SSTORE_METADATA_GLOBAL_AMCA_PARAM_STR, strlen(SSTORE_METADATA_GLOBAL_AMCA_PARAM_STR))) { + global_snapshot->amca_param = strdup(value); + } + } while(0 == feof(metadata) ); + + /* Append the last item */ + if( NULL != vpid_snapshot) { + opal_list_append(&global_snapshot->local_snapshots, &(vpid_snapshot->super)); + } + + cleanup: + if( NULL != metadata ) { + fclose(metadata); + metadata = NULL; + } + if( NULL != value ) { + free(value); + value = NULL; + } + if( NULL != token ) { + free(token); + token = NULL; + } + + return exit_status; +} + +int orte_sstore_base_find_largest_seq_num(orte_sstore_base_global_snapshot_info_t *global_snapshot, int *seq_num) +{ + int exit_status = ORTE_SUCCESS; + FILE *metadata = NULL; + int tmp_seq_num = -1; + + *seq_num = -1; + + /* + * Open the metadata file + */ + if (NULL == (metadata = fopen(global_snapshot->metadata_filename, "r")) ) { + opal_output(orte_sstore_base_output, + "sstore:base:find_largest_seq_num() Unable to open the file (%s)\n", + global_snapshot->metadata_filename); + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + while(0 <= (tmp_seq_num = orte_sstore_base_metadata_read_next_seq_num(metadata)) ) { + if( tmp_seq_num > *seq_num ) { + *seq_num = tmp_seq_num; + } + } + + if( *seq_num < 0 ) { + exit_status = ORTE_ERROR; + } + + cleanup: + if( NULL != metadata ) { + fclose(metadata); + metadata = NULL; + } + + return exit_status; +} + +int orte_sstore_base_find_all_seq_nums(orte_sstore_base_global_snapshot_info_t *global_snapshot, int *num_seq, char ***seq_list) +{ + int exit_status = ORTE_SUCCESS; + FILE *metadata = NULL; + int tmp_seq_num = -1; + char * tmp_str = NULL; + + *num_seq = 0; + *seq_list = NULL; + + /* + * Open the metadata file + */ + if (NULL == (metadata = fopen(global_snapshot->metadata_filename, "r")) ) { + opal_output(orte_sstore_base_output, + "sstore:base:find_all_seq_nums() Unable to open the file (%s)\n", + global_snapshot->metadata_filename); + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + while(0 <= (tmp_seq_num = orte_sstore_base_metadata_read_next_seq_num(metadata)) ) { + asprintf(&tmp_str, "%d", tmp_seq_num); + + if( NULL != tmp_str ) { + opal_argv_append(num_seq, seq_list, tmp_str); + free(tmp_str); + tmp_str = NULL; + } + } + + cleanup: + if( NULL != metadata ) { + fclose(metadata); + metadata = NULL; + } + + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + return exit_status; +} + + +/* + * Extract the next sequence number from the file + */ +int orte_sstore_base_metadata_read_next_seq_num(FILE *file) +{ + char *token = NULL; + char *value = NULL; + int seq_int = -1; + + do { + if( ORTE_SUCCESS != orte_sstore_base_metadata_read_next_token(file, &token, &value) ) { + seq_int = -1; + goto cleanup; + } + } while(0 != strncmp(token, SSTORE_METADATA_INTERNAL_DONE_SEQ_STR, strlen(SSTORE_METADATA_INTERNAL_DONE_SEQ_STR))); + + seq_int = atoi(value); + + cleanup: + if( NULL != token) { + free(token); + token = NULL; + } + + if( NULL != value) { + free(value); + value = NULL; + } + + return seq_int; +} + +int orte_sstore_base_metadata_seek_to_seq_num(FILE *file, int seq_num) +{ + char *token = NULL; + char *value = NULL; + int seq_int = -1; + + rewind(file); + + do { + do { + if( ORTE_SUCCESS != orte_sstore_base_metadata_read_next_token(file, &token, &value) ) { + seq_int = -1; + goto cleanup; + } + } while(0 != strncmp(token, SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR, strlen(SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR)) ); + seq_int = atoi(value); + } while(seq_num != seq_int ); + + cleanup: + if( NULL != token) { + free(token); + token = NULL; + } + + if( NULL != value) { + free(value); + value = NULL; + } + + if( seq_num != seq_int ) { + return ORTE_ERROR; + } else { + return ORTE_SUCCESS; + } +} + +int orte_sstore_base_metadata_read_next_token(FILE *file, char **token, char **value) +{ + int exit_status = ORTE_SUCCESS; + int max_len = 256; + char * line = NULL; + int line_len = 0; + int c = 0, s = 0, v = 0; + char *local_token = NULL; + char *local_value = NULL; + bool end_of_line = false; + + line = (char *) malloc(sizeof(char) * max_len); + + try_again: + /* + * If we are at the end of the file, then just return + */ + if(0 != feof(file) ) { + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * Other wise grab the next token/value pair + */ + if (NULL == fgets(line, max_len, file) ) { + exit_status = ORTE_ERROR; + goto cleanup; + } + line_len = strlen(line); + /* Strip off the new line if it it there */ + if('\n' == line[line_len-1]) { + line[line_len-1] = '\0'; + line_len--; + end_of_line = true; + } + else { + end_of_line = false; + } + + /* Ignore lines with just '#' too */ + if(2 >= line_len) + goto try_again; + + /* + * Extract the token from the set + */ + for(c = 0; + line[c] != ':' && + c < line_len; + ++c) { + ; + } + c += 2; /* For the ' ' and the '\0' */ + local_token = (char *)malloc(sizeof(char) * (c + 1)); + + for(s = 0; s < c; ++s) { + local_token[s] = line[s]; + } + + local_token[s] = '\0'; + *token = strdup(local_token); + + if( NULL != local_token) { + free(local_token); + local_token = NULL; + } + + /* + * Extract the value from the set + */ + local_value = (char *)malloc(sizeof(char) * (line_len - c + 1)); + for(v = 0, s = c; + s < line_len; + ++s, ++v) { + local_value[v] = line[s]; + } + + while(!end_of_line) { + if (NULL == fgets(line, max_len, file) ) { + exit_status = ORTE_ERROR; + goto cleanup; + } + line_len = strlen(line); + /* Strip off the new line if it it there */ + if('\n' == line[line_len-1]) { + line[line_len-1] = '\0'; + line_len--; + end_of_line = true; + } + else { + end_of_line = false; + } + + local_value = (char *)realloc(local_value, sizeof(char) * line_len); + for(s = 0; + s < line_len; + ++s, ++v) { + local_value[v] = line[s]; + } + } + + local_value[v] = '\0'; + *value = strdup(local_value); + + cleanup: + if( NULL != local_token) { + free(local_token); + local_token = NULL; + } + + if( NULL != local_value) { + free(local_value); + local_value = NULL; + } + + if( NULL != line) { + free(line); + line = NULL; + } + + return exit_status; +} diff --git a/orte/mca/sstore/base/sstore_base_open.c b/orte/mca/sstore/base/sstore_base_open.c new file mode 100644 index 0000000000..f2448b7a84 --- /dev/null +++ b/orte/mca/sstore/base/sstore_base_open.c @@ -0,0 +1,232 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" + +#include "orte/constants.h" +#include "opal/mca/mca.h" +#include "opal/util/output.h" +#include "opal/mca/base/base.h" + +#include "opal/mca/base/mca_base_param.h" +#include "orte/util/proc_info.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "orte/mca/sstore/base/static-components.h" + +#if ORTE_DISABLE_FULL_SUPPORT +/* have to include a bogus function here so that + * the build system sees at least one function + * in the library + */ +int orte_sstore_base_open(void) +{ + return ORTE_SUCCESS; +} + +#else + +/* + * Globals + */ +ORTE_DECLSPEC int orte_sstore_base_output = -1; +orte_sstore_base_module_t orte_sstore = { + NULL, /* sstore_init */ + NULL, /* ssotore_finalize */ + + NULL, /* request_checkpoint_handle */ + NULL, /* request_restart_handle */ + NULL, /* request_global_snapshot_data */ + NULL, /* register_handle */ + NULL, /* get_attr */ + NULL, /* set_attr */ + NULL, /* sync */ + NULL, /* remove */ + NULL, /* pack */ + NULL, /* unpack */ + NULL, /* fetch_app_deps */ + NULL /* wait_all_deps */ +}; +opal_list_t orte_sstore_base_components_available; +orte_sstore_base_component_t orte_sstore_base_selected_component; +int orte_sstore_context; + +bool orte_sstore_base_is_checkpoint_available = false; +char * orte_sstore_base_local_metadata_filename; +char * orte_sstore_base_global_metadata_filename; +char * orte_sstore_base_local_snapshot_fmt; +char * orte_sstore_base_global_snapshot_dir = NULL; +char * orte_sstore_base_global_snapshot_ref = NULL; +char * orte_sstore_base_prelaunch_location = NULL; + +orte_sstore_base_handle_t orte_sstore_handle_current; +orte_sstore_base_handle_t orte_sstore_handle_last_stable; + +/* Determine the context of this module */ +int orte_sstore_base_determine_context(void); + +/** + * Function for finding and opening either all MCA components, + * or the one that was specifically requested via a MCA parameter. + */ +int orte_sstore_base_open(void) +{ + char *str_value = NULL; + int mca_index; + + orte_sstore_handle_current = ORTE_SSTORE_HANDLE_INVALID; + orte_sstore_handle_last_stable = ORTE_SSTORE_HANDLE_INVALID; + + orte_sstore_base_local_metadata_filename = strdup("snapshot_meta.data"); + orte_sstore_base_global_metadata_filename = strdup("global_snapshot_meta.data"); + orte_sstore_base_local_snapshot_fmt = strdup("opal_snapshot_%d.ckpt"); + + orte_sstore_base_output = opal_output_open(NULL); + + /* + * Base Global Snapshot directory + */ + mca_index = mca_base_param_reg_string_name("sstore", + "base_global_snapshot_dir", + "The base directory to use when storing global snapshots", + false, false, + opal_home_directory(), + &orte_sstore_base_global_snapshot_dir); + mca_base_param_reg_syn_name(mca_index, "snapc", "base_global_snapshot_dir", true); + + /* + * User defined snapshot reference to use for this job + */ + mca_index = mca_base_param_reg_string_name("sstore", + "base_global_snapshot_ref", + "The global snapshot reference to be used for this job. " + " [Default = ompi_global_snapshot_MPIRUNPID.ckpt]", + false, false, + NULL, + &orte_sstore_base_global_snapshot_ref); + mca_base_param_reg_syn_name(mca_index, "snapc", "base_global_snapshot_ref", true); + + /* + * Old, dead parameters + */ +#if 0 + /* (Should just choose to use the 'central' component + * Store the checkpoint files in their final location. + * This assumes that the storage place is on a shared file + * system that all nodes can access uniformly. + * Default = enabled + */ + mca_base_param_reg_int_name("sstore", + "base_store_in_place", + "If global_snapshot_dir is on a shared file system all nodes can access, " + "then the checkpoint files can be stored in place instead of incurring a " + "remote copy. [Default = enabled]", + false, false, + 1, + &value); +#endif + +#if 0 + OPAL_OUTPUT_VERBOSE((20, orte_sstore_base_output, + "sstore:base: open: base_global_snapshot_ref = %s", + orte_sstore_base_global_snapshot_ref)); + + /* + * Pre-establish the global snapshot directory upon job registration + */ + mca_base_param_reg_int_name("sstore_base", + "establish_global_snapshot_dir", + "Establish the global snapshot directory on job startup. [Default = disabled]", + false, false, + 0, + &value); + orte_sstore_base_establish_global_snapshot_dir = OPAL_INT_TO_BOOL(value); + + OPAL_OUTPUT_VERBOSE((20, orte_sstore_base_output, + "sstore:base: open: base_establish_global_snapshot_dir = %d", + orte_sstore_base_establish_global_snapshot_dir)); +#endif + + /* + * Setup the prelaunch variable to point to the first possible snapshot + * location + */ + asprintf(&orte_sstore_base_prelaunch_location, + "%s/%s/%d", + orte_sstore_base_global_snapshot_dir, + orte_sstore_base_global_snapshot_ref, + 0); + + opal_output_verbose(10, orte_sstore_base_output, + "sstore:base: open()"); + opal_output_verbose(10, orte_sstore_base_output, + "sstore:base: open: Global snapshot directory = %s", + orte_sstore_base_global_snapshot_dir); + opal_output_verbose(10, orte_sstore_base_output, + "sstore:base: open: Global snapshot reference = %s", + (NULL == orte_sstore_base_global_snapshot_ref ? "Default" : orte_sstore_base_global_snapshot_ref)); + opal_output_verbose(10, orte_sstore_base_output, + "sstore:base: open: Prelaunch location = %s", + orte_sstore_base_prelaunch_location); + + /* + * Which Sstore component to open + * - NULL or "" = auto-select + * - ow. select that specific component + */ + mca_base_param_reg_string_name("sstore", NULL, + "Which Sstore component to use (empty = auto-select)", + false, false, + NULL, &str_value); + if( NULL != str_value ) { + free(str_value); + str_value = NULL; + } + + /* Open up all available components */ + if (OPAL_SUCCESS != + mca_base_components_open("sstore", + orte_sstore_base_output, + mca_sstore_base_static_components, + &orte_sstore_base_components_available, + true)) { + return ORTE_ERROR; + } + + orte_sstore_context = ORTE_SSTORE_UNASSIGN_TYPE; + orte_sstore_base_determine_context(); + + return ORTE_SUCCESS; +} + +int orte_sstore_base_determine_context(void) +{ + if( ORTE_PROC_IS_HNP) { + orte_sstore_context |= ORTE_SSTORE_GLOBAL_TYPE; + if( ORTE_PROC_IS_DAEMON ) { + orte_sstore_context |= ORTE_SSTORE_LOCAL_TYPE; + } + } + else if( ORTE_PROC_IS_DAEMON ) { + orte_sstore_context |= ORTE_SSTORE_LOCAL_TYPE; + } + else if( ORTE_PROC_IS_TOOL ) { + orte_sstore_context |= ORTE_SSTORE_TOOL_TYPE; + } + else if( !ORTE_PROC_IS_DAEMON ) { + orte_sstore_context |= ORTE_SSTORE_APP_TYPE; + } + + return ORTE_SUCCESS; +} + +#endif diff --git a/orte/mca/sstore/base/sstore_base_select.c b/orte/mca/sstore/base/sstore_base_select.c new file mode 100644 index 0000000000..5fd628bf9e --- /dev/null +++ b/orte/mca/sstore/base/sstore_base_select.c @@ -0,0 +1,61 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif + +#include "orte/constants.h" + +#include "opal/mca/mca.h" +#include "opal/util/output.h" +#include "opal/mca/base/base.h" + +#include "opal/mca/base/mca_base_param.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + + +int orte_sstore_base_select(void) +{ + int exit_status = OPAL_SUCCESS; + orte_sstore_base_component_t *best_component = NULL; + orte_sstore_base_module_t *best_module = NULL; + + /* + * Select the best component + */ + if( OPAL_SUCCESS != mca_base_select("sstore", orte_sstore_base_output, + &orte_sstore_base_components_available, + (mca_base_module_t **) &best_module, + (mca_base_component_t **) &best_component) ) { + /* This will only happen if no component was selected */ + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* Save the winner */ + orte_sstore_base_selected_component = *best_component; + orte_sstore = *best_module; + + /* Initialize the winner */ + if (NULL != best_module) { + if (OPAL_SUCCESS != orte_sstore.sstore_init()) { + exit_status = OPAL_ERROR; + goto cleanup; + } + } + + cleanup: + return exit_status; +} diff --git a/orte/mca/sstore/central/Makefile.am b/orte/mca/sstore/central/Makefile.am new file mode 100644 index 0000000000..2735214ec4 --- /dev/null +++ b/orte/mca/sstore/central/Makefile.am @@ -0,0 +1,40 @@ +# +# Copyright (c) 2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +dist_pkgdata_DATA = help-orte-sstore-central.txt + +sources = \ + sstore_central.h \ + sstore_central_component.c \ + sstore_central_module.c \ + sstore_central_global.c \ + sstore_central_local.c \ + sstore_central_app.c + +# Make the output library in this directory, and name it either +# mca__.la (for DSO builds) or libmca__.la +# (for static builds). + +if OMPI_BUILD_sstore_central_DSO +component_noinst = +component_install = mca_sstore_central.la +else +component_noinst = libmca_sstore_central.la +component_install = +endif + +mcacomponentdir = $(pkglibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_sstore_central_la_SOURCES = $(sources) +mca_sstore_central_la_LDFLAGS = -module -avoid-version + +noinst_LTLIBRARIES = $(component_noinst) +libmca_sstore_central_la_SOURCES = $(sources) +libmca_sstore_central_la_LDFLAGS = -module -avoid-version diff --git a/orte/mca/sstore/central/configure.m4 b/orte/mca/sstore/central/configure.m4 new file mode 100644 index 0000000000..40e4daf7ab --- /dev/null +++ b/orte/mca/sstore/central/configure.m4 @@ -0,0 +1,20 @@ +# -*- shell-script -*- +# +# Copyright (c) 2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# MCA_sstore_central_CONFIG([action-if-found], [action-if-not-found]) +# ----------------------------------------------------------- +AC_DEFUN([MCA_sstore_central_CONFIG],[ + # If we don't want FT, don't compile this component + AS_IF([test "$opal_want_ft_cr" = "1"], + [$1], + [$2]) +])dnl diff --git a/orte/mca/sstore/central/configure.params b/orte/mca/sstore/central/configure.params new file mode 100644 index 0000000000..3f3b068680 --- /dev/null +++ b/orte/mca/sstore/central/configure.params @@ -0,0 +1,13 @@ +# -*- shell-script -*- +# +# Copyright (c) 2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_INIT_FILE=sstore_central_component.c +PARAM_CONFIG_FILES="Makefile" diff --git a/orte/mca/sstore/central/help-orte-sstore-central.txt b/orte/mca/sstore/central/help-orte-sstore-central.txt new file mode 100644 index 0000000000..08738e0765 --- /dev/null +++ b/orte/mca/sstore/central/help-orte-sstore-central.txt @@ -0,0 +1,19 @@ + -*- text -*- +# +# Copyright (c) 2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for ORTE SStore framework. +# +[fail_path_create] +Error: Failed to create the following directory. + Check to make sure this process/node can access the specified directory. +Process : %s +Node : %s +Directory: %s diff --git a/orte/mca/sstore/central/sstore_central.h b/orte/mca/sstore/central/sstore_central.h new file mode 100644 index 0000000000..47421f09a6 --- /dev/null +++ b/orte/mca/sstore/central/sstore_central.h @@ -0,0 +1,127 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * + * CENTRAL SSTORE component + * + */ + +#ifndef MCA_SSTORE_CENTRAL_EXPORT_H +#define MCA_SSTORE_CENTRAL_EXPORT_H + +#include "orte_config.h" + +#include "opal/mca/mca.h" + +#include "orte/mca/sstore/sstore.h" + +BEGIN_C_DECLS + +typedef uint8_t orte_sstore_central_cmd_flag_t; +#define ORTE_SSTORE_CENTRAL_CMD OPAL_UINT8 +#define ORTE_SSTORE_CENTRAL_PULL 1 +#define ORTE_SSTORE_CENTRAL_PUSH 2 + + /* + * Local Component structures + */ + struct orte_sstore_central_component_t { + /** Base SSTORE component */ + orte_sstore_base_component_t super; + }; + typedef struct orte_sstore_central_component_t orte_sstore_central_component_t; + ORTE_MODULE_DECLSPEC extern orte_sstore_central_component_t mca_sstore_central_component; + + int orte_sstore_central_component_query(mca_base_module_t **module, int *priority); + + /* + * Module functions + */ + int orte_sstore_central_module_init(void); + int orte_sstore_central_module_finalize(void); + + int orte_sstore_central_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); + int orte_sstore_central_request_restart_handle(orte_sstore_base_handle_t *handle, char *basedir, char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot); + int orte_sstore_central_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot); + int orte_sstore_central_register(orte_sstore_base_handle_t handle); + + int orte_sstore_central_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); + int orte_sstore_central_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); + + int orte_sstore_central_sync(orte_sstore_base_handle_t handle); + int orte_sstore_central_remove(orte_sstore_base_handle_t handle); + + int orte_sstore_central_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); + int orte_sstore_central_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + + int orte_sstore_central_fetch_app_deps(orte_app_context_t *app); + int orte_sstore_central_wait_all_deps(void); + + /* + * HNP functions + */ +int orte_sstore_central_global_module_init(void); +int orte_sstore_central_global_module_finalize(void); +int orte_sstore_central_global_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); +int orte_sstore_central_global_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot); +int orte_sstore_central_global_request_restart_handle(orte_sstore_base_handle_t *handle, char *basedir, + char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot); +int orte_sstore_central_global_register(orte_sstore_base_handle_t handle); +int orte_sstore_central_global_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); +int orte_sstore_central_global_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); +int orte_sstore_central_global_sync(orte_sstore_base_handle_t handle); +int orte_sstore_central_global_remove(orte_sstore_base_handle_t handle); +int orte_sstore_central_global_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); +int orte_sstore_central_global_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + + /* + * Orted functions + */ +int orte_sstore_central_local_module_init(void); +int orte_sstore_central_local_module_finalize(void); +int orte_sstore_central_local_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); +int orte_sstore_central_local_register(orte_sstore_base_handle_t handle); +int orte_sstore_central_local_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); +int orte_sstore_central_local_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); +int orte_sstore_central_local_sync(orte_sstore_base_handle_t handle); +int orte_sstore_central_local_remove(orte_sstore_base_handle_t handle); +int orte_sstore_central_local_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); +int orte_sstore_central_local_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + +void orte_sstore_central_local_process_cmd(int fd, + short event, + void *cbdata); + /* + * Application functions + */ +int orte_sstore_central_app_module_init(void); +int orte_sstore_central_app_module_finalize(void); +int orte_sstore_central_app_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); +int orte_sstore_central_app_register(orte_sstore_base_handle_t handle); +int orte_sstore_central_app_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); +int orte_sstore_central_app_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); +int orte_sstore_central_app_sync(orte_sstore_base_handle_t handle); +int orte_sstore_central_app_remove(orte_sstore_base_handle_t handle); +int orte_sstore_central_app_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); +int orte_sstore_central_app_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + + /* + * Internal utility functions + */ + +END_C_DECLS + +#endif /* MCA_SSTORE_CENTRAL_EXPORT_H */ diff --git a/orte/mca/sstore/central/sstore_central_app.c b/orte/mca/sstore/central/sstore_central_app.c new file mode 100644 index 0000000000..2506109399 --- /dev/null +++ b/orte/mca/sstore/central/sstore_central_app.c @@ -0,0 +1,742 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "orte/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/show_help.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/os_dirpath.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/rml/rml_types.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_central.h" + +/********** + * Object stuff + **********/ +struct orte_sstore_central_app_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** */ + orte_sstore_base_handle_t id; + + /** Global Sequence Number */ + int seq_num; + + /** Global Reference Name */ + char * global_ref_name; + + /** Local Location (Absolute Path) */ + char * local_location; + + /** Metadata File Name (Absolute Path) */ + char *metadata_filename; + + /** Metadata File Descriptor */ + FILE *metadata; + + /** CRS Component used */ + char * crs_comp; + + /** Did this process skip the checkpoint? */ + bool ckpt_skipped; +}; +typedef struct orte_sstore_central_app_snapshot_info_t orte_sstore_central_app_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_central_app_snapshot_info_t); + +void orte_sstore_central_app_snapshot_info_construct(orte_sstore_central_app_snapshot_info_t *info); +void orte_sstore_central_app_snapshot_info_destruct( orte_sstore_central_app_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_central_app_snapshot_info_t, + opal_list_item_t, + orte_sstore_central_app_snapshot_info_construct, + orte_sstore_central_app_snapshot_info_destruct); + + +/********** + * Local Function and Variable Declarations + **********/ +static orte_sstore_central_app_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle); +static orte_sstore_central_app_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle); + +static int init_local_snapshot_directory(orte_sstore_central_app_snapshot_info_t *handle_info); +static int pull_handle_info(orte_sstore_central_app_snapshot_info_t *handle_info ); +static int push_handle_info(orte_sstore_central_app_snapshot_info_t *handle_info ); + +static int metadata_open(orte_sstore_central_app_snapshot_info_t * handle_info); +static int metadata_close(orte_sstore_central_app_snapshot_info_t * handle_info); +static int metadata_write_str(orte_sstore_central_app_snapshot_info_t * handle_info, char * key, char *value); +static int metadata_write_int(orte_sstore_central_app_snapshot_info_t * handle_info, char *key, int value); +static int metadata_write_timestamp(orte_sstore_central_app_snapshot_info_t * handle_info); + +static opal_list_t *active_handles = NULL; + +/********** + * Object stuff + **********/ +void orte_sstore_central_app_snapshot_info_construct(orte_sstore_central_app_snapshot_info_t *info) +{ + info->id = 0; + + info->seq_num = -1; + + info->global_ref_name = NULL; + info->local_location = NULL; + + info->metadata_filename = NULL; + info->metadata = NULL; + + info->crs_comp = NULL; + + info->ckpt_skipped = false; +} + +void orte_sstore_central_app_snapshot_info_destruct( orte_sstore_central_app_snapshot_info_t *info) +{ + info->id = 0; + info->seq_num = -1; + + if( NULL != info->global_ref_name ) { + free( info->global_ref_name ); + info->global_ref_name = NULL; + } + + if( NULL != info->local_location ) { + free( info->local_location ); + info->local_location = NULL; + } + + if( NULL != info->metadata_filename ) { + free( info->metadata_filename ) ; + info->metadata_filename = NULL; + } + + if( NULL != info->metadata ) { + fclose(info->metadata); + info->metadata = NULL; + } + + if( NULL != info->crs_comp ) { + free( info->crs_comp ); + info->crs_comp = NULL; + } + + info->ckpt_skipped = false; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_central_app_module_init(void) +{ + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + return ORTE_SUCCESS; +} + +int orte_sstore_central_app_module_finalize(void) +{ + if( NULL != active_handles ) { + OBJ_RELEASE(active_handles); + } + + return ORTE_SUCCESS; +} + +int orte_sstore_central_app_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + opal_output(0, "sstore:central:(app): request_checkpoint_handle() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_central_app_register(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_app_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): register(%d)", (int)handle)); + + /* + * Create a handle + */ + orte_sstore_handle_current = handle; + handle_info = create_new_handle_info(handle); + + /* + * Get basic information from Local SStore + */ + if( ORTE_SUCCESS != (ret = pull_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Setup the storage directory + */ + if( ORTE_SUCCESS != (ret = init_local_snapshot_directory(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_app_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + int exit_status = ORTE_SUCCESS; + orte_sstore_central_app_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): get_attr(%d)", key)); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Access metadata + */ + if( SSTORE_METADATA_GLOBAL_SNAP_SEQ == key ) { + asprintf(value, "%d", handle_info->seq_num); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): get_attr(%d, %d) Seq = <%s>", key, handle_info->id, *value)); + } + else if( SSTORE_METADATA_LOCAL_SNAP_LOC == key) { + *value = strdup(handle_info->local_location); + } + else if( SSTORE_METADATA_LOCAL_SNAP_META == key ) { + *value = strdup(handle_info->metadata_filename); + } + else if( SSTORE_METADATA_GLOBAL_SNAP_REF == key ) { + *value = strdup(handle_info->global_ref_name); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): get_attr(%d, %d) Ref = <%s>", key, handle_info->id, *value)); + } + else { + exit_status = ORTE_ERR_NOT_SUPPORTED; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): get_attr(%d, %d) <%s>", key, handle_info->id, *value)); + cleanup: + return exit_status; +} + +int orte_sstore_central_app_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_app_snapshot_info_t *handle_info = NULL; + char *key_str = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): set_attr(%d = %s)", key, value)); + + if( NULL == value ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if( key >= SSTORE_METADATA_MAX ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Access metadata + */ + if( SSTORE_METADATA_LOCAL_CRS_COMP == key ) { + if( NULL != handle_info->crs_comp ) { + free(handle_info->crs_comp); + } + handle_info->crs_comp = strdup(value); + } + else if(SSTORE_METADATA_LOCAL_SKIP_CKPT == key ) { + handle_info->ckpt_skipped = true; + } + else if( SSTORE_METADATA_LOCAL_MKDIR == key || + SSTORE_METADATA_LOCAL_TOUCH == key ) { + orte_sstore_base_convert_key_to_string(key, &key_str); + if( ORTE_SUCCESS != (ret = metadata_write_str(handle_info, key_str, value))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + else { + exit_status = ORTE_ERROR; + goto cleanup; + } + + cleanup: + if( NULL != key_str ) { + free(key_str); + key_str = NULL; + } + + return exit_status; +} + +int orte_sstore_central_app_sync(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_app_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): sync()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Finalize and close the metadata + */ + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_close(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Push information to the Local coordinator + */ + if( ORTE_SUCCESS != (ret = push_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + orte_sstore_handle_current = ORTE_SSTORE_HANDLE_INVALID; + + return exit_status; +} + +int orte_sstore_central_app_remove(orte_sstore_base_handle_t handle) +{ + opal_output(0, "sstore:central:(app): remove() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_central_app_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + opal_output(0, "sstore:central:(app): pack() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_central_app_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + opal_output(0, "sstore:central:(app): unpack() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +/************************** + * Local functions + **************************/ +static orte_sstore_central_app_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_central_app_snapshot_info_t *handle_info = NULL; + + handle_info = OBJ_NEW(orte_sstore_central_app_snapshot_info_t); + + handle_info->id = handle; + + opal_list_append(active_handles, &(handle_info->super)); + + return handle_info; +} + +static orte_sstore_central_app_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_central_app_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_central_app_snapshot_info_t*)item; + + if( handle_info->id == handle ) { + return handle_info; + } + } + + return NULL; +} + +static int pull_handle_info(orte_sstore_central_app_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_central_cmd_flag_t command; + orte_std_cntr_t count; + orte_sstore_base_handle_t loc_id; + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + /* + * Ask the daemon to send us the info that we need + */ + command = ORTE_SSTORE_CENTRAL_PULL; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_CENTRAL_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Receive the response + */ + OBJ_DESTRUCT(&buffer); + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): pull() from %s -> %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(ORTE_PROC_MY_DAEMON))); + if( ORTE_SUCCESS != (ret = orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, + &buffer, + ORTE_RML_TAG_SSTORE_INTERNAL, + 0)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &command, &count, ORTE_SSTORE_CENTRAL_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &loc_id, &count, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + if( loc_id != handle_info->id ) { + ; /* JJH Big problem */ + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->seq_num), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->global_ref_name), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->local_location), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->metadata_filename), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(app): pull() from %s -> %s (%d, %d, %s)", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(ORTE_PROC_MY_DAEMON), + handle_info->id, + handle_info->seq_num, + handle_info->global_ref_name + )); + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + +static int push_handle_info(orte_sstore_central_app_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_central_cmd_flag_t command; + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + command = ORTE_SSTORE_CENTRAL_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_CENTRAL_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->ckpt_skipped), 1, OPAL_BOOL )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !handle_info->ckpt_skipped ) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->crs_comp), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + +static int init_local_snapshot_directory(orte_sstore_central_app_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + mode_t my_mode = S_IRWXU; + + /* + * Make the snapshot directory from the uniq_global_snapshot_name + */ + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(handle_info->local_location, my_mode)) ) { + opal_show_help("help-orte-sstore-central.txt", "fail_path_create", true, + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + orte_process_info.nodename, + handle_info->local_location); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Open up the metadata file + */ + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Add a timestamp and the PID of this process + */ + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, SSTORE_METADATA_LOCAL_PID_STR, (int)getpid())) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_close(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + + +/************************** + * Metadata functions + **************************/ +static int metadata_open(orte_sstore_central_app_snapshot_info_t * handle_info) +{ + /* If already open, then just return */ + if( NULL != handle_info->metadata ) { + return ORTE_SUCCESS; + } + + if (NULL == (handle_info->metadata = fopen(handle_info->metadata_filename, "a")) ) { + opal_output(orte_sstore_base_output, + "sstore:central:(global):init_dir() Unable to open the file (%s)\n", + handle_info->metadata_filename); + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + return ORTE_SUCCESS; +} + +static int metadata_close(orte_sstore_central_app_snapshot_info_t * handle_info) +{ + /* If already closed, then just return */ + if( NULL == handle_info->metadata ) { + return ORTE_SUCCESS; + } + + fclose(handle_info->metadata); + handle_info->metadata = NULL; + + return ORTE_SUCCESS; +} + +static int metadata_write_str(orte_sstore_central_app_snapshot_info_t * handle_info, char *key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%s\n", key, value); + + cleanup: + /* Must close the metadata each time, since if we try to checkpoint the + * CRS might want to restore the FD, and will likely fail if the snapshot + * moved */ + if( NULL != handle_info->metadata ) { + fclose(handle_info->metadata); + handle_info->metadata = NULL; + } + + return exit_status; +} + +static int metadata_write_int(orte_sstore_central_app_snapshot_info_t * handle_info, char *key, int value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%d\n", key, value); + + cleanup: + return exit_status; +} + +static int metadata_write_timestamp(orte_sstore_central_app_snapshot_info_t * handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + time_t timestamp; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + timestamp = time(NULL); + fprintf(handle_info->metadata, "%s%s", SSTORE_METADATA_INTERNAL_TIME_STR, ctime(×tamp)); + + cleanup: + return exit_status; +} diff --git a/orte/mca/sstore/central/sstore_central_component.c b/orte/mca/sstore/central/sstore_central_component.c new file mode 100644 index 0000000000..808b8a2311 --- /dev/null +++ b/orte/mca/sstore/central/sstore_central_component.c @@ -0,0 +1,115 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" +#include "opal/util/output.h" +#include "orte/constants.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" +#include "sstore_central.h" + +/* + * Public string for version number + */ +const char *orte_sstore_central_component_version_string = + "ORTE SSTORE central MCA component version " ORTE_VERSION; + +/* + * Local functionality + */ +static int sstore_central_open(void); +static int sstore_central_close(void); + +/* + * Instantiate the public struct with all of our public information + * and pointer to our public functions in it + */ +orte_sstore_central_component_t mca_sstore_central_component = { + /* First do the base component stuff */ + { + /* Handle the general mca_component_t struct containing + * meta information about the component itcentral + */ + { + ORTE_SSTORE_BASE_VERSION_2_0_0, + /* Component name and version */ + "central", + ORTE_MAJOR_VERSION, + ORTE_MINOR_VERSION, + ORTE_RELEASE_VERSION, + + /* Component open and close functions */ + sstore_central_open, + sstore_central_close, + orte_sstore_central_component_query + }, + { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + /* Verbosity level */ + 0, + /* opal_output handler */ + -1, + /* Default priority */ + 20 + }, +}; + +static int sstore_central_open(void) +{ + mca_base_param_reg_int(&mca_sstore_central_component.super.base_version, + "priority", + "Priority of the SSTORE central component", + false, false, + mca_sstore_central_component.super.priority, + &mca_sstore_central_component.super.priority); + + mca_base_param_reg_int(&mca_sstore_central_component.super.base_version, + "verbose", + "Verbose level for the SSTORE central component", + false, false, + mca_sstore_central_component.super.verbose, + &mca_sstore_central_component.super.verbose); + /* If there is a custom verbose level for this component than use it + * otherwise take our parents level and output channel + */ + if ( 0 != mca_sstore_central_component.super.verbose) { + mca_sstore_central_component.super.output_handle = opal_output_open(NULL); + opal_output_set_verbosity(mca_sstore_central_component.super.output_handle, + mca_sstore_central_component.super.verbose); + } else { + mca_sstore_central_component.super.output_handle = orte_sstore_base_output; + } + + /* + * Debug Output + */ + opal_output_verbose(10, mca_sstore_central_component.super.output_handle, + "sstore:central: open()"); + opal_output_verbose(20, mca_sstore_central_component.super.output_handle, + "sstore:central: open: priority = %d", + mca_sstore_central_component.super.priority); + opal_output_verbose(20, mca_sstore_central_component.super.output_handle, + "sstore:central: open: verbosity = %d", + mca_sstore_central_component.super.verbose); + + return ORTE_SUCCESS; +} + +static int sstore_central_close(void) +{ + opal_output_verbose(10, mca_sstore_central_component.super.output_handle, + "sstore:central: close()"); + + return ORTE_SUCCESS; +} diff --git a/orte/mca/sstore/central/sstore_central_global.c b/orte/mca/sstore/central/sstore_central_global.c new file mode 100644 index 0000000000..71ef8c50cd --- /dev/null +++ b/orte/mca/sstore/central/sstore_central_global.c @@ -0,0 +1,1243 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "orte/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/os_dirpath.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/mca/snapc/base/base.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_central.h" + +#define SSTORE_HANDLE_TYPE_NONE 0 +#define SSTORE_HANDLE_TYPE_CKPT 1 +#define SSTORE_HANDLE_TYPE_RESTART 2 + +#define SSTORE_GLOBAL_NONE 0 +#define SSTORE_GLOBAL_ERROR 1 +#define SSTORE_GLOBAL_INIT 2 +#define SSTORE_GLOBAL_REG 3 +#define SSTORE_GLOBAL_SYNCING 4 +#define SSTORE_GLOBAL_SYNCED 5 + +/********** + * Object Stuff + **********/ +struct orte_sstore_central_global_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** */ + orte_sstore_base_handle_t id; + + /** Job ID */ + orte_jobid_t jobid; + + /** State */ + int state; + + /** Handle type */ + int handle_type; + + /** Sequence Number */ + int seq_num; + + /** Reference Name */ + char * ref_name; + + /** Local Location (Relative Path to base_location) */ + char * local_location; + + /** Application location format */ + char * app_location_fmt; + + /** Base location */ + char * base_location; + + /** Metadata File Name */ + char *metadata_filename; + + /** Metadata File Descriptor */ + FILE *metadata; + + /** Num procs in job */ + int num_procs_total; + + /** Num procs synced */ + int num_procs_synced; + + /** Is this checkpoint representing a migration? */ + bool migrating; +}; +typedef struct orte_sstore_central_global_snapshot_info_t orte_sstore_central_global_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_central_global_snapshot_info_t); + +void orte_sstore_central_global_snapshot_info_construct(orte_sstore_central_global_snapshot_info_t *info); +void orte_sstore_central_global_snapshot_info_destruct( orte_sstore_central_global_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_central_global_snapshot_info_t, + opal_list_item_t, + orte_sstore_central_global_snapshot_info_construct, + orte_sstore_central_global_snapshot_info_destruct); + + +/********** + * Local Function and Variable Declarations + **********/ +static bool is_global_listener_active = false; +static int sstore_central_global_start_listener(void); +static int sstore_central_global_stop_listener(void); +static void sstore_central_global_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata); +static void sstore_central_global_process_cmd(int fd, + short event, + void *cbdata); +static int process_local_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_global_snapshot_info_t *handle_info); +static int process_local_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_global_snapshot_info_t *handle_info); + +static orte_sstore_central_global_snapshot_info_t *create_new_handle_info(int seq, int type, orte_jobid_t jobid); +static orte_sstore_central_global_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle); +static orte_sstore_central_global_snapshot_info_t *find_handle_info_from_ref(char *ref, int seq); + +static int metadata_open(orte_sstore_central_global_snapshot_info_t * handle_info); +static int metadata_close(orte_sstore_central_global_snapshot_info_t * handle_info); +static int metadata_write_int(orte_sstore_central_global_snapshot_info_t * handle_info, char * key, int value); +static int metadata_write_str(orte_sstore_central_global_snapshot_info_t * handle_info, char * key, char *value); +static int metadata_write_timestamp(orte_sstore_central_global_snapshot_info_t * handle_info); + +static int init_global_snapshot_directory(orte_sstore_central_global_snapshot_info_t *handle_info); +static int central_snapshot_sort_compare_fn(opal_list_item_t **a, + opal_list_item_t **b); +static int orte_sstore_central_extract_global_metadata(orte_sstore_central_global_snapshot_info_t * handle_info, + orte_sstore_base_global_snapshot_info_t *global_snapshot); + +static int next_handle_id = 1; + +static opal_list_t *active_handles = NULL; + +/********** + * Object stuff + **********/ +void orte_sstore_central_global_snapshot_info_construct(orte_sstore_central_global_snapshot_info_t *info) +{ + info->id = next_handle_id; + next_handle_id++; + + info->jobid = ORTE_JOBID_INVALID; + + info->state = SSTORE_GLOBAL_NONE; + + info->handle_type = SSTORE_HANDLE_TYPE_NONE; + + info->seq_num = -1; + + info->base_location = strdup(orte_sstore_base_global_snapshot_dir); + + info->ref_name = NULL; + info->local_location = NULL; + info->app_location_fmt = NULL; + + info->metadata_filename = NULL; + info->metadata = NULL; + + info->num_procs_total = 0; + info->num_procs_synced = 0; + + info->migrating = false; +} + +void orte_sstore_central_global_snapshot_info_destruct( orte_sstore_central_global_snapshot_info_t *info) +{ + info->id = 0; + info->seq_num = -1; + + info->jobid = ORTE_JOBID_INVALID; + + info->state = SSTORE_GLOBAL_NONE; + + info->handle_type = SSTORE_HANDLE_TYPE_NONE; + + if( NULL != info->ref_name ) { + free( info->ref_name ); + info->ref_name = NULL; + } + + if( NULL != info->local_location ) { + free( info->local_location ); + info->local_location = NULL; + } + + if( NULL != info->app_location_fmt ) { + free( info->app_location_fmt ); + info->app_location_fmt = NULL; + } + + if( NULL != info->base_location ) { + free( info->base_location ); + info->base_location = NULL; + } + + if( NULL != info->metadata_filename ) { + free( info->metadata_filename ) ; + info->metadata_filename = NULL; + } + + if( NULL != info->metadata ) { + fclose(info->metadata); + info->metadata = NULL; + } + + info->num_procs_total = 0; + info->num_procs_synced = 0; + + info->migrating = false; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_central_global_module_init(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): init()")); + + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + /* + * Setup a listener for the HNP/Apps + */ + if( ORTE_SUCCESS != (ret = sstore_central_global_start_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + exit_status = orte_sstore_central_local_module_init(); + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_module_finalize(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): finalize()")); + + exit_status = orte_sstore_central_local_module_finalize(); + + if( NULL != active_handles ) { + OBJ_RELEASE(active_handles); + } + + /* + * Shutdown the listener for the HNP/Apps + */ + if( ORTE_SUCCESS != (ret = sstore_central_global_stop_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): request_checkpoint_handle()")); + + /* + * Construct a handle + * - Associate all of the necessary information + */ + handle_info = create_new_handle_info(seq, SSTORE_HANDLE_TYPE_CKPT, jobid); + + /* + * Create the global checkpoint directory + */ + if( ORTE_SUCCESS != (ret = init_global_snapshot_directory(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Return the handle + */ + *handle = handle_info->id; + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_request_restart_handle(orte_sstore_base_handle_t *handle, char *basedir, + char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + handle_info = find_handle_info_from_ref(ref, seq); + if( NULL == handle_info ) { + ret = ORTE_ERROR; + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + *handle = handle_info->id; + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): request_global_snapshot_data()")); + + /* + * Lookup the handle (if NULL, use last stable) + */ + if( NULL != handle ) { + handle_info = find_handle_info(*handle); + snapshot->ss_handle = *handle; + } else { + handle_info = find_handle_info(orte_sstore_handle_last_stable); + snapshot->ss_handle = orte_sstore_handle_last_stable; + } + + /* + * Construct the snapshot from local data, and metadata file + */ + snapshot->seq_num = handle_info->seq_num; + snapshot->reference = strdup(handle_info->ref_name); + snapshot->basedir = strdup(handle_info->base_location); + snapshot->metadata_filename = strdup(handle_info->metadata_filename); + + /* If this is the current checkpoint, pull data from local cache */ + if( orte_sstore_handle_current == snapshot->ss_handle ) { + if( ORTE_SUCCESS != (ret = orte_sstore_central_extract_global_metadata(handle_info, snapshot)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + /* Otherwise, pull from metadata */ + else { + if( ORTE_SUCCESS != (ret = orte_sstore_base_extract_global_metadata(snapshot)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + opal_list_sort(&snapshot->local_snapshots, central_snapshot_sort_compare_fn); + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_register(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): register(%d) - Global", handle)); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + if( SSTORE_GLOBAL_REG != handle_info->state ) { + handle_info->state = SSTORE_GLOBAL_REG; + } else { + return orte_sstore_central_local_register(handle); + } + + orte_sstore_handle_current = handle; + + /* + * Associate the metadata + */ + if( handle_info->migrating ) { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_INTERNAL_MIG_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } else { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if( ORTE_SUCCESS != (ret = metadata_write_str(handle_info, + SSTORE_METADATA_LOCAL_SNAP_REF_FMT_STR, + orte_sstore_base_local_snapshot_fmt)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + int exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): get_attr()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Access metadata + */ + if( SSTORE_METADATA_GLOBAL_SNAP_REF == key ) { + *value = strdup(handle_info->ref_name); + } + else if( SSTORE_METADATA_GLOBAL_SNAP_SEQ == key ) { + asprintf(value, "%d", handle_info->seq_num); + } + else if( SSTORE_METADATA_LOCAL_SNAP_REF_FMT == key ) { + *value = strdup(orte_sstore_base_local_snapshot_fmt); + } + /* 'central' does not cache, so these are the same */ + else if( SSTORE_METADATA_LOCAL_SNAP_LOC == key ) { + asprintf(value, "%s/%s/%d", + handle_info->base_location, + handle_info->ref_name, + handle_info->seq_num); + } + else if( SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT == key ) { + asprintf(value, "%s/%s/%d/%s", + handle_info->base_location, + handle_info->ref_name, + handle_info->seq_num, + orte_sstore_base_local_snapshot_fmt); + } + else { + exit_status = ORTE_ERR_NOT_SUPPORTED; + } + + return exit_status; +} + +int orte_sstore_central_global_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + char *key_str = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): set_attr()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Process key (Access metadata) + */ + if( key == SSTORE_METADATA_GLOBAL_MIGRATING ) { + handle_info->migrating = true; + } + else { + orte_sstore_base_convert_key_to_string(key, &key_str); + if( NULL == key_str ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_write_str(handle_info, key_str, value))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + if( NULL != key_str ) { + free(key_str); + key_str = NULL; + } + + return exit_status; +} + +int orte_sstore_central_global_sync(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): sync()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + if( SSTORE_GLOBAL_SYNCING != handle_info->state ) { + handle_info->state = SSTORE_GLOBAL_SYNCING; + if( ORTE_SNAPC_LOCAL_COORD_TYPE == (orte_snapc_coord_type & ORTE_SNAPC_LOCAL_COORD_TYPE) ) { + return orte_sstore_central_local_sync(handle); + } + } + + /* + * Synchronize all of the files + */ + while(handle_info->num_procs_synced < handle_info->num_procs_total) { + opal_progress(); + } + + /* + * Finalize and close the metadata + */ + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( handle_info->migrating ) { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_INTERNAL_DONE_MIG_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } else { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_INTERNAL_DONE_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if( ORTE_SUCCESS != (ret = metadata_close(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* JJH: We should lock this var! */ + if( !handle_info->migrating ) { + orte_sstore_base_is_checkpoint_available = true; + orte_sstore_handle_last_stable = orte_sstore_handle_current; + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_remove(orte_sstore_base_handle_t handle) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): remove()")); + + /* + * Lookup the handle + */ + + return ORTE_SUCCESS; +} + +int orte_sstore_central_global_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): pack()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Pack the handle ID + */ + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &handle, 1, ORTE_SSTORE_HANDLE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): pack(%d, %d, %s)", + handle_info->id, + handle_info->seq_num, + handle_info->ref_name)); + + /* + * Pack any metadata + */ + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->seq_num), 1, OPAL_INT )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->ref_name), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->app_location_fmt), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_global_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): unpack()")); + + /* + * Unpack the handle id + */ + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_JOBID, + ORTE_PROC_MY_NAME, + peer)) { + /* + * Differ to the orted version, so if we have application then they get updated too + */ + if( ORTE_SUCCESS != (ret = orte_sstore_central_local_unpack(peer, buffer, handle)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + return exit_status; +} + +/************************** + * Local functions + **************************/ +static orte_sstore_central_global_snapshot_info_t *create_new_handle_info(int seq, int type, orte_jobid_t jobid) +{ + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + orte_job_t *jdata = NULL; + + handle_info = OBJ_NEW(orte_sstore_central_global_snapshot_info_t); + + handle_info->jobid = jobid; + + handle_info->state = SSTORE_GLOBAL_INIT; + + handle_info->handle_type = type; + + handle_info->seq_num = seq; + + orte_sstore_base_get_global_snapshot_ref(&(handle_info->ref_name), getpid()); + + asprintf(&(handle_info->local_location), "%s/%d", + handle_info->ref_name, handle_info->seq_num); + + asprintf(&(handle_info->app_location_fmt), "%s/%s/%s", + handle_info->base_location, + handle_info->local_location, + orte_sstore_base_local_snapshot_fmt); + + asprintf(&(handle_info->metadata_filename), "%s/%s/%s", + handle_info->base_location, + handle_info->ref_name, + orte_sstore_base_global_metadata_filename); + + jdata = orte_get_job_data_object(handle_info->jobid); + handle_info->num_procs_total = (int)jdata->num_procs; + + opal_list_append(active_handles, &(handle_info->super)); + + return handle_info; +} + +static orte_sstore_central_global_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_central_global_snapshot_info_t*)item; + + if( handle_info->id == handle ) { + return handle_info; + } + } + + return NULL; +} + +static orte_sstore_central_global_snapshot_info_t *find_handle_info_from_ref(char *ref, int seq) +{ + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_central_global_snapshot_info_t*)item; + + if( handle_info->seq_num == seq ) { + if( NULL != ref && + strncmp(handle_info->ref_name, ref, strlen(ref)) ) { + return handle_info; + } else { + return handle_info; + } + } + } + + return NULL; +} + +static int sstore_central_global_start_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if( is_global_listener_active ) { + return ORTE_SUCCESS; + } + + if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL, + ORTE_RML_PERSISTENT, + sstore_central_global_recv, + NULL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = true; + + cleanup: + return exit_status; +} + +static int sstore_central_global_stop_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = false; + + cleanup: + return exit_status; +} + +static void sstore_central_global_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata) +{ + if( ORTE_RML_TAG_SSTORE_INTERNAL != tag ) { + return; + } + + ORTE_MESSAGE_EVENT(sender, buffer, tag, sstore_central_global_process_cmd); + + return; +} + +static void sstore_central_global_process_cmd(int fd, + short event, + void *cbdata) +{ + int ret; + orte_message_event_t *mev = (orte_message_event_t*)cbdata; + orte_process_name_t *sender = NULL; + orte_sstore_central_cmd_flag_t command; + orte_std_cntr_t count; + orte_sstore_base_handle_t loc_id; + orte_sstore_central_global_snapshot_info_t *handle_info = NULL; + + sender = &(mev->sender); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(global): process_cmd(%s)", + ORTE_NAME_PRINT(sender))); + + /* + * If this was an application process contacting us, then act like an orted + * instead of an HNP + */ + if(OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_JOBID, + ORTE_PROC_MY_NAME, + sender)) { + orte_sstore_central_local_process_cmd(fd, event, cbdata); + return; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &command, &count, ORTE_SSTORE_CENTRAL_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &loc_id, &count, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Find the referenced handle + */ + if(NULL == (handle_info = find_handle_info(loc_id)) ) { + ; /* JJH big problem */ + } + + /* + * Process the command + */ + if( ORTE_SSTORE_CENTRAL_PULL == command ) { + process_local_pull(sender, mev->buffer, handle_info); + } + else if( ORTE_SSTORE_CENTRAL_PUSH == command ) { + process_local_push(sender, mev->buffer, handle_info); + } + + cleanup: + return; +} + +static int process_local_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t loc_buffer; + orte_sstore_central_cmd_flag_t command; + + /* + * Push back the requested information + */ + OBJ_CONSTRUCT(&loc_buffer, opal_buffer_t); + + command = ORTE_SSTORE_CENTRAL_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &command, 1, ORTE_SSTORE_CENTRAL_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->seq_num), 1, OPAL_INT )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->ref_name), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->app_location_fmt), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(peer, &loc_buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&loc_buffer); + + return exit_status; +} + +static int process_local_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count; + size_t num_entries, i; + orte_process_name_t name; + bool ckpt_skipped = false; + char * crs_comp = NULL; + char * proc_name = NULL; + + /* + * Unpack the data + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &num_entries, &count, OPAL_SIZE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + for(i = 0; i < num_entries; ++i ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &name, &count, ORTE_NAME))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &ckpt_skipped, &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !ckpt_skipped ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &crs_comp, &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Write this information to the global metadata + */ + orte_util_convert_process_name_to_string(&proc_name, &name); + + metadata_write_str(handle_info, + SSTORE_METADATA_INTERNAL_PROCESS_STR, + proc_name); + metadata_write_str(handle_info, + SSTORE_METADATA_LOCAL_CRS_COMP_STR, + crs_comp); + } + + if( NULL != crs_comp ) { + free(crs_comp); + crs_comp = NULL; + } + if( NULL != proc_name ) { + free(proc_name); + proc_name = NULL; + } + + (handle_info->num_procs_synced)++; + } + + cleanup: + if( NULL != crs_comp ) { + free(crs_comp); + crs_comp = NULL; + } + if( NULL != proc_name ) { + free(proc_name); + proc_name = NULL; + } + + return exit_status; +} + +static int init_global_snapshot_directory(orte_sstore_central_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + char * dir_name = NULL; + mode_t my_mode = S_IRWXU; + + /* + * Make the snapshot directory from the uniq_global_snapshot_name + */ + asprintf(&dir_name, "%s/%s", + handle_info->base_location, + handle_info->local_location); + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(dir_name, my_mode)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Open up the metadata file + */ + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + if(NULL != dir_name) { + free(dir_name); + dir_name = NULL; + } + + return exit_status; +} + +/************************** + * Metadata functions + **************************/ +static int metadata_open(orte_sstore_central_global_snapshot_info_t * handle_info) +{ + /* If already open, then just return */ + if( NULL != handle_info->metadata ) { + return ORTE_SUCCESS; + } + + if (NULL == (handle_info->metadata = fopen(handle_info->metadata_filename, "a")) ) { + opal_output(orte_sstore_base_output, + "sstore:central:(global):init_dir() Unable to open the file (%s)\n", + handle_info->metadata_filename); + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + return ORTE_SUCCESS; +} + +static int metadata_close(orte_sstore_central_global_snapshot_info_t * handle_info) +{ + /* If already closed, then just return */ + if( NULL == handle_info->metadata ) { + return ORTE_SUCCESS; + } + + fclose(handle_info->metadata); + handle_info->metadata = NULL; + + return ORTE_SUCCESS; +} + +static int metadata_write_int(orte_sstore_central_global_snapshot_info_t * handle_info, char *key, int value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%d\n", key, value); + + cleanup: + return exit_status; +} + +static int metadata_write_str(orte_sstore_central_global_snapshot_info_t * handle_info, char *key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%s\n", key, value); + + cleanup: + return exit_status; +} + +static int metadata_write_timestamp(orte_sstore_central_global_snapshot_info_t * handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + time_t timestamp; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + timestamp = time(NULL); + fprintf(handle_info->metadata, "%s%s", + SSTORE_METADATA_INTERNAL_TIME_STR, + ctime(×tamp)); + + cleanup: + return exit_status; +} + +static int orte_sstore_central_extract_global_metadata(orte_sstore_central_global_snapshot_info_t * handle_info, + orte_sstore_base_global_snapshot_info_t *global_snapshot) +{ + int exit_status = ORTE_SUCCESS; + orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL; + opal_list_item_t* item = NULL; + int i = 0; + + /* + * Cleanup the structure a bit, so we can refresh it below + */ + while (NULL != (item = opal_list_remove_first(&global_snapshot->local_snapshots))) { + OBJ_RELEASE(item); + } + + if( NULL != global_snapshot->start_time ) { + free( global_snapshot->start_time ); + global_snapshot->start_time = NULL; + } + + if( NULL != global_snapshot->end_time ) { + free( global_snapshot->end_time ); + global_snapshot->end_time = NULL; + } + + /* + * Create a structure for each application process + */ + for(i = 0; i < handle_info->num_procs_total; ++i) { + vpid_snapshot = OBJ_NEW(orte_sstore_base_local_snapshot_info_t); + vpid_snapshot->ss_handle = handle_info->id; + + vpid_snapshot->process_name.jobid = handle_info->jobid; + vpid_snapshot->process_name.vpid = i; + + vpid_snapshot->crs_comp = NULL; + global_snapshot->start_time = NULL; + global_snapshot->end_time = NULL; + + opal_list_append(&global_snapshot->local_snapshots, &(vpid_snapshot->super)); + } + + return exit_status; +} + +static int central_snapshot_sort_compare_fn(opal_list_item_t **a, + opal_list_item_t **b) +{ + orte_sstore_base_local_snapshot_info_t *snap_a, *snap_b; + + snap_a = (orte_sstore_base_local_snapshot_info_t*)(*a); + snap_b = (orte_sstore_base_local_snapshot_info_t*)(*b); + + if( snap_a->process_name.vpid > snap_b->process_name.vpid ) { + return 1; + } + else if( snap_a->process_name.vpid == snap_b->process_name.vpid ) { + return 0; + } + else { + return -1; + } +} diff --git a/orte/mca/sstore/central/sstore_central_local.c b/orte/mca/sstore/central/sstore_central_local.c new file mode 100644 index 0000000000..a977839a5b --- /dev/null +++ b/orte/mca/sstore/central/sstore_central_local.c @@ -0,0 +1,995 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "orte/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/odls/odls_types.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_central.h" + +/********** + * Object stuff + **********/ +#define SSTORE_LOCAL_NONE 0 +#define SSTORE_LOCAL_ERROR 1 +#define SSTORE_LOCAL_INIT 2 +#define SSTORE_LOCAL_READY 3 +#define SSTORE_LOCAL_SYNCED 4 + +struct orte_sstore_central_local_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** */ + orte_sstore_base_handle_t id; + + /** Status */ + int status; + + /** Sequence Number */ + int seq_num; + + /** Global Reference Name */ + char * global_ref_name; + + /** Local Location Format String */ + char * location_fmt; + + /* Application info handles*/ + opal_list_t *app_info_handle; +}; +typedef struct orte_sstore_central_local_snapshot_info_t orte_sstore_central_local_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_central_local_snapshot_info_t); + +void orte_sstore_central_local_snapshot_info_construct(orte_sstore_central_local_snapshot_info_t *info); +void orte_sstore_central_local_snapshot_info_destruct( orte_sstore_central_local_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_central_local_snapshot_info_t, + opal_list_item_t, + orte_sstore_central_local_snapshot_info_construct, + orte_sstore_central_local_snapshot_info_destruct); + +struct orte_sstore_central_local_app_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** Process Name associated with this entry */ + orte_process_name_t name; + + /** Local Location (Absolute Path) */ + char * local_location; + + /** Metadata File Name (Absolute Path) */ + char * metadata_filename; + + /** CRS Component used */ + char * crs_comp; + + /** If this app. skipped the checkpoint - usually for non-migrating procs */ + bool ckpt_skipped; +}; +typedef struct orte_sstore_central_local_app_snapshot_info_t orte_sstore_central_local_app_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_central_local_app_snapshot_info_t); + +void orte_sstore_central_local_app_snapshot_info_construct(orte_sstore_central_local_app_snapshot_info_t *info); +void orte_sstore_central_local_app_snapshot_info_destruct( orte_sstore_central_local_app_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_central_local_app_snapshot_info_t, + opal_list_item_t, + orte_sstore_central_local_app_snapshot_info_construct, + orte_sstore_central_local_app_snapshot_info_destruct); + + + +/********** + * Local Function and Variable Declarations + **********/ +static bool is_global_listener_active = false; +static int sstore_central_local_start_listener(void); +static int sstore_central_local_stop_listener(void); +static void sstore_central_local_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata); + +static int process_global_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info); +static int process_global_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info); +static int process_app_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info); +static int process_app_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info); + +static orte_sstore_central_local_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle); +static orte_sstore_central_local_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle); + +static int append_new_app_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info, + orte_process_name_t *name); +static orte_sstore_central_local_app_snapshot_info_t *find_app_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info, + orte_process_name_t *name); + +static int pull_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info ); +static int push_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info ); + +static int wait_all_apps_updated(orte_sstore_central_local_snapshot_info_t *handle_info); + + +static opal_list_t *active_handles = NULL; + +/********** + * Object stuff + **********/ +void orte_sstore_central_local_snapshot_info_construct(orte_sstore_central_local_snapshot_info_t *info) +{ + info->id = 0; + + info->status = SSTORE_LOCAL_NONE; + + info->seq_num = -1; + + info->global_ref_name = NULL; + + info->location_fmt = NULL; + + info->app_info_handle = OBJ_NEW(opal_list_t); +} + +void orte_sstore_central_local_snapshot_info_destruct( orte_sstore_central_local_snapshot_info_t *info) +{ + info->id = 0; + + info->status = SSTORE_LOCAL_NONE; + + info->seq_num = -1; + + if( NULL != info->global_ref_name ) { + free( info->global_ref_name ); + info->global_ref_name = NULL; + } + + if( NULL != info->location_fmt ) { + free( info->location_fmt ); + info->location_fmt = NULL; + } + + if( NULL != info->app_info_handle ) { + OBJ_RELEASE(info->app_info_handle); + info->app_info_handle = NULL; + } +} + +void orte_sstore_central_local_app_snapshot_info_construct(orte_sstore_central_local_app_snapshot_info_t *info) +{ + info->name.jobid = ORTE_JOBID_INVALID; + info->name.vpid = ORTE_VPID_INVALID; + + info->local_location = NULL; + info->metadata_filename = NULL; + info->crs_comp = NULL; + info->ckpt_skipped = false; +} + +void orte_sstore_central_local_app_snapshot_info_destruct( orte_sstore_central_local_app_snapshot_info_t *info) +{ + info->name.jobid = ORTE_JOBID_INVALID; + info->name.vpid = ORTE_VPID_INVALID; + + if( NULL != info->local_location ) { + free(info->local_location); + info->local_location = NULL; + } + + if( NULL != info->metadata_filename ) { + free(info->metadata_filename); + info->metadata_filename = NULL; + } + + if( NULL != info->crs_comp ) { + free(info->crs_comp); + info->crs_comp = NULL; + } + + info->ckpt_skipped = false; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_central_local_module_init(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): init()")); + + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + /* + * Setup a listener for the HNP/Apps + * We could be the HNP, in which case the listener is already registered. + */ + if( !ORTE_PROC_IS_HNP ) { + if( ORTE_SUCCESS != (ret = sstore_central_local_start_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_local_module_finalize(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): finalize()")); + + if( NULL != active_handles ) { + OBJ_RELEASE(active_handles); + } + + /* + * Shutdown the listener for the HNP/Apps + * We could be the HNP, in which case the listener is already deregistered. + */ + if( !ORTE_PROC_IS_HNP ) { + if( ORTE_SUCCESS != (ret = sstore_central_local_stop_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_local_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + opal_output(0, "sstore:central:(local): request_checkpoint_handle() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_central_local_register(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_local_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): register()")); + + /* + * Create a handle + */ + if( NULL == (handle_info = find_handle_info(handle)) ) { + handle_info = create_new_handle_info(handle); + } + + /* + * Get basic information from Global SStore + */ + if( ORTE_SUCCESS != (ret = pull_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Wait here until the pull request has been satisfied + */ + while(SSTORE_LOCAL_READY != handle_info->status && + SSTORE_LOCAL_ERROR != handle_info->status ) { + opal_progress(); + } + + cleanup: + return exit_status; +} + +int orte_sstore_central_local_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + opal_output(0, "sstore:central:(local): get_attr() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_central_local_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + opal_output(0, "sstore:central:(local): set_attr() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_central_local_sync(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_local_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): sync()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Wait for all of the applications to update their metadata + */ + if( ORTE_SUCCESS != (ret = wait_all_apps_updated(handle_info))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Push information to the Global coordinator + */ + if( ORTE_SUCCESS != (ret = push_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + handle_info->status = SSTORE_LOCAL_SYNCED; + + cleanup: + return exit_status; +} + +int orte_sstore_central_local_remove(orte_sstore_base_handle_t handle) +{ + opal_output(0, "sstore:central:(local): remove() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_central_local_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): pack()")); + + /* + * Lookup the handle + */ + + + /* + * Pack the handle ID + */ + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &handle, 1, ORTE_SSTORE_HANDLE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Pack any metadata + */ + + cleanup: + return exit_status; +} + +int orte_sstore_central_local_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_central_local_snapshot_info_t *handle_info = NULL; + orte_std_cntr_t count; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): unpack()")); + + /* + * Unpack the handle id + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, handle, &count, ORTE_SSTORE_HANDLE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Lookup the handle + */ + if( NULL == (handle_info = find_handle_info(*handle)) ) { + handle_info = create_new_handle_info(*handle); + } + + /* + * Unpack the metadata piggybacked on this message + */ + if( ORTE_SUCCESS != (ret = process_global_push(peer, buffer, handle_info))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): unpack(%d, %d, %s)", + handle_info->id, + handle_info->seq_num, + handle_info->global_ref_name)); + + cleanup: + return exit_status; +} + +/************************** + * Local functions + **************************/ +static orte_sstore_central_local_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_central_local_snapshot_info_t *handle_info = NULL; + opal_list_item_t *item = NULL; + orte_odls_child_t *child = NULL; + + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + handle_info = OBJ_NEW(orte_sstore_central_local_snapshot_info_t); + + handle_info->id = handle; + + opal_list_append(active_handles, &(handle_info->super)); + + /* + * Create a sub structure for each child + */ + for (item = opal_list_get_first(&orte_local_children); + item != opal_list_get_end(&orte_local_children); + item = opal_list_get_next(item)) { + child = (orte_odls_child_t*)item; + append_new_app_handle_info(handle_info, child->name); + } + + handle_info->status = SSTORE_LOCAL_INIT; + + return handle_info; +} + +static orte_sstore_central_local_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_central_local_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + if( NULL == active_handles ) { + return NULL; + } + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_central_local_snapshot_info_t*)item; + + if( handle_info->id == handle ) { + return handle_info; + } + } + + return NULL; +} + +static int append_new_app_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info, + orte_process_name_t *name) +{ + orte_sstore_central_local_app_snapshot_info_t *app_info = NULL; + + app_info = OBJ_NEW(orte_sstore_central_local_app_snapshot_info_t); + + app_info->name.jobid = name->jobid; + app_info->name.vpid = name->vpid; + + opal_list_append(handle_info->app_info_handle, &(app_info->super)); + + return ORTE_SUCCESS; +} + +static orte_sstore_central_local_app_snapshot_info_t *find_app_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info, + orte_process_name_t *name) +{ + orte_sstore_central_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t* item = NULL; + + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_central_local_app_snapshot_info_t*)item; + + if( app_info->name.jobid == name->jobid && + app_info->name.vpid == name->vpid ) { + return app_info; + } + } + + return NULL; +} + +static int sstore_central_local_start_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if( is_global_listener_active ) { + return ORTE_SUCCESS; + } + + if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL, + ORTE_RML_PERSISTENT, + sstore_central_local_recv, + NULL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = true; + + cleanup: + return exit_status; +} + +static int sstore_central_local_stop_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = false; + + cleanup: + return exit_status; +} + +static void sstore_central_local_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata) +{ + if( ORTE_RML_TAG_SSTORE_INTERNAL != tag ) { + return; + } + + ORTE_MESSAGE_EVENT(sender, buffer, tag, orte_sstore_central_local_process_cmd); + + return; +} + +void orte_sstore_central_local_process_cmd(int fd, + short event, + void *cbdata) +{ + int ret; + orte_message_event_t *mev = (orte_message_event_t*)cbdata; + orte_process_name_t *sender = NULL; + orte_sstore_central_cmd_flag_t command; + orte_std_cntr_t count; + orte_sstore_base_handle_t loc_id; + orte_sstore_central_local_snapshot_info_t *handle_info = NULL; + + sender = &(mev->sender); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): process_cmd(%s)", + ORTE_NAME_PRINT(sender))); + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &command, &count, ORTE_SSTORE_CENTRAL_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &loc_id, &count, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Find the referenced handle (Create if it does not exist) + */ + if(NULL == (handle_info = find_handle_info(loc_id)) ) { + handle_info = create_new_handle_info(loc_id); + } + + /* + * Process the command + */ + if( ORTE_SSTORE_CENTRAL_PULL == command ) { + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_HNP, sender)) { + process_global_pull(sender, mev->buffer, handle_info); + } else { + process_app_pull(sender, mev->buffer, handle_info); + } + } + else if( ORTE_SSTORE_CENTRAL_PUSH == command ) { + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_HNP, sender)) { + process_global_push(sender, mev->buffer, handle_info); + } else { + process_app_push(sender, mev->buffer, handle_info); + } + } + + cleanup: + return; +} + +static int process_global_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info) +{ + /* JJH should be as simple as calling push_handle_info() */ + opal_output(0, "sstore:central:(local): process_global_pull() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +static int process_global_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count; + orte_sstore_central_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t* item = NULL; + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->seq_num), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->global_ref_name), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->location_fmt), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * For each process we are working with + */ + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_central_local_app_snapshot_info_t*)item; + + if( NULL != app_info->local_location ) { + free(app_info->local_location); + app_info->local_location = NULL; + } + asprintf(&(app_info->local_location), handle_info->location_fmt, app_info->name.vpid); + + if( NULL != app_info->metadata_filename ) { + free(app_info->metadata_filename); + app_info->metadata_filename = NULL; + } + asprintf(&(app_info->metadata_filename), "%s/%s", + app_info->local_location, + orte_sstore_base_local_metadata_filename); + } + + cleanup: + if( ORTE_SUCCESS == exit_status ) { + handle_info->status = SSTORE_LOCAL_READY; + } else { + handle_info->status = SSTORE_LOCAL_ERROR; + } + + return exit_status; +} + +static int process_app_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t loc_buffer; + orte_sstore_central_cmd_flag_t command; + orte_sstore_central_local_app_snapshot_info_t *app_info = NULL; + + /* + * Find this app's data + */ + app_info = find_app_handle_info(handle_info, peer); + + /* + * Push back the requested information + */ + OBJ_CONSTRUCT(&loc_buffer, opal_buffer_t); + + command = ORTE_SSTORE_CENTRAL_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &command, 1, ORTE_SSTORE_CENTRAL_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->seq_num), 1, OPAL_INT )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->global_ref_name), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(app_info->local_location), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(app_info->metadata_filename), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(peer, &loc_buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&loc_buffer); + + return exit_status; +} + +static int process_app_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_central_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count; + orte_sstore_central_local_app_snapshot_info_t *app_info = NULL; + + /* + * Find this app's data + */ + app_info = find_app_handle_info(handle_info, peer); + + /* + * Unpack the data + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(app_info->ckpt_skipped), &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !app_info->ckpt_skipped ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(app_info->crs_comp), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): app_push(%s, skip=%s, %s)", + ORTE_NAME_PRINT(&(app_info->name)), + (app_info->ckpt_skipped ? "T" : "F"), + app_info->crs_comp)); + + cleanup: + return exit_status; +} + +static int wait_all_apps_updated(orte_sstore_central_local_snapshot_info_t *handle_info) +{ + orte_sstore_central_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t *item = NULL; + bool is_done = true; + + do { + is_done = true; + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_central_local_app_snapshot_info_t*)item; + + if( NULL == app_info->crs_comp && !app_info->ckpt_skipped ) { + is_done = false; + break; + } + } + + if( !is_done ) { + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central:(local): Waiting for appliccation %s", + ORTE_NAME_PRINT(&(app_info->name)) )); + opal_progress(); + } + } while(!is_done); + + return ORTE_SUCCESS; +} + +static int pull_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_central_cmd_flag_t command; + + /* + * Check to see if this is necessary + * (Did we get all of the info from the handle unpack?) + */ + if( 0 <= handle_info->seq_num && + NULL != handle_info->global_ref_name && + NULL != handle_info->location_fmt ) { + handle_info->status = SSTORE_LOCAL_READY; + return ORTE_SUCCESS; + } + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + /* + * Ask the daemon to send us the info that we need + */ + command = ORTE_SSTORE_CENTRAL_PULL; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_CENTRAL_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + +static int push_handle_info(orte_sstore_central_local_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_central_cmd_flag_t command; + orte_sstore_central_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t *item = NULL; + size_t list_size; + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + command = ORTE_SSTORE_CENTRAL_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_CENTRAL_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + list_size = opal_list_get_size(handle_info->app_info_handle); + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &list_size, 1, OPAL_SIZE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * For each process we are working with + */ + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_central_local_app_snapshot_info_t*)item; + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(app_info->name), 1, ORTE_NAME )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(app_info->ckpt_skipped), 1, OPAL_BOOL )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !app_info->ckpt_skipped ) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(app_info->crs_comp), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} diff --git a/orte/mca/sstore/central/sstore_central_module.c b/orte/mca/sstore/central/sstore_central_module.c new file mode 100644 index 0000000000..15b7d3a831 --- /dev/null +++ b/orte/mca/sstore/central/sstore_central_module.c @@ -0,0 +1,359 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "orte/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/snapc/snapc.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_central.h" + +/********** + * Local Function and Variable Declarations + **********/ + +/* + * central module + */ +static orte_sstore_base_module_t loc_module = { + /** Initialization Function */ + orte_sstore_central_module_init, + /** Finalization Function */ + orte_sstore_central_module_finalize, + + orte_sstore_central_request_checkpoint_handle, + orte_sstore_central_request_restart_handle, + orte_sstore_central_request_global_snapshot_data, + orte_sstore_central_register, + orte_sstore_central_get_attr, + orte_sstore_central_set_attr, + orte_sstore_central_sync, + orte_sstore_central_remove, + + orte_sstore_central_pack, + orte_sstore_central_unpack, + orte_sstore_central_fetch_app_deps, + orte_sstore_central_wait_all_deps +}; + +/* + * MCA Functions + */ +int orte_sstore_central_component_query(mca_base_module_t **module, int *priority) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central: component_query()")); + + *priority = mca_sstore_central_component.super.priority; + *module = (mca_base_module_t *)&loc_module; + + return ORTE_SUCCESS; +} + +int orte_sstore_central_module_init(void) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central: module_init()")); + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + /* Global is also a Local, so it will init the local module internally */ + return orte_sstore_central_global_module_init(); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_module_init(); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_module_init(); + } + + return ORTE_SUCCESS; +} + +int orte_sstore_central_module_finalize(void) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_central_component.super.output_handle, + "sstore:central: module_finalize()")); + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + /* Global is also a Local, so it will fin. the local module internally */ + return orte_sstore_central_global_module_finalize(); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_module_finalize(); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_module_finalize(); + } + + return ORTE_SUCCESS; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_central_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + opal_output(0, "sstore:central:(tool): request_checkpoint_handle() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_request_checkpoint_handle(handle, seq, jobid); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_request_checkpoint_handle(handle, seq, jobid); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_request_checkpoint_handle(handle, seq, jobid); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_request_restart_handle(orte_sstore_base_handle_t *handle, char *basedir, char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + return orte_sstore_base_tool_request_restart_handle(handle, basedir, ref, seq, snapshot); + } + else if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_request_restart_handle(handle, basedir, ref, seq, snapshot); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + opal_output(0, "sstore:central:(local): request_restart_handle() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + opal_output(0, "sstore:central:(app): request_restart_handle() Not supported!"); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + opal_output(0, "sstore:central:(tool): request_global_snapshot_data() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_request_global_snapshot_data(handle, snapshot); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + opal_output(0, "sstore:central:(local): request_global_snapshot_data() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + opal_output(0, "sstore:central:(app): request_global_snapshot_data() Not supported!"); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_register(orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_register(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_register(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_register(handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + return orte_sstore_base_tool_get_attr(handle, key, value); + } + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_get_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_get_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_get_attr(handle, key, value); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + opal_output(0, "Error: (%6s) Passed an invalid handle (%d) [%d = \"%s\"]", + (orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ? "global" : + (orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ? "local" : + (orte_sstore_context & ORTE_SSTORE_APP_TYPE ? "app" : "other"))), + (int)handle, key, value); + { + int sleeper = 2; + while(sleeper == 1 ) { + sleep(1); + } + } + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_set_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_set_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_set_attr(handle, key, value); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_sync(orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_sync(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_sync(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_sync(handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_remove(orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_remove(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_remove(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_remove(handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_pack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_pack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_pack(peer, buffer, handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_central_global_unpack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_central_local_unpack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_central_app_unpack(peer, buffer, handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_central_fetch_app_deps(orte_app_context_t *app) +{ + /* Nothing to do */ + return ORTE_SUCCESS; +} + +int orte_sstore_central_wait_all_deps(void) +{ + /* Nothing to do */ + return ORTE_SUCCESS; +} + +/************************** + * Local functions + **************************/ diff --git a/orte/mca/sstore/orte_sstore.7in b/orte/mca/sstore/orte_sstore.7in new file mode 100644 index 0000000000..155ca7964d --- /dev/null +++ b/orte/mca/sstore/orte_sstore.7in @@ -0,0 +1,66 @@ +.\" +.\" Copyright (c) 2010 The Trustees of Indiana University and Indiana +.\" University Research and Technology +.\" Corporation. All rights reserved. +.\" +.\" Man page for ORTE's SStore Functionality +.\" +.\" .TH name section center-footer left-footer center-header +.TH ORTE_SSTORE 7 "#OMPI_DATE#" "#PACKAGE_VERSION#" "#PACKAGE_NAME#" +.\" ************************** +.\" Name Section +.\" ************************** +.SH NAME +. +Open RTE MCA File Management (SStore) Framework \- Overview of Open RTE's SStore +framework, and selected modules. #PACKAGE_NAME# #PACKAGE_VERSION# +. +.\" ************************** +.\" Description Section +.\" ************************** +.SH DESCRIPTION +. +.PP +SStore is a utility framework used by OpenRTE for a variety of purposes, including +the transport of checkpoint files, preloading user binaries, and preloading of user files. +. +.\" ********************************** +.\" Available Components Section +.\" ********************************** +.SH AVAILABLE COMPONENTS +.PP +The following MCA parameters apply to all components: +. +.TP 4 +sstore_base_verbose +Set the verbosity level for all components. Default is 0, or silent except on +error. +. +. +.\" central Component +.\" ****************** +.SS central SStore Component +.PP +The \fIcentral\fR component implements a fully centralized stable storage +mechanism that requires a shared storage medium (e.g., NFS). +. +.PP +The \fIcentral\fR component has the following MCA parameters: +. +.TP 4 +sstore_central_priority +The component's priority to use when selecting the most appropriate component +for a run. +. +.TP 4 +sstore_central_verbose +Set the verbosity level for this component. Default is 0, or silent except on +error. +. +.\" ************************** +.\" See Also Section +.\" ************************** +. +.SH SEE ALSO + orte-checkpoint(1), orte-restart(1), opal-checkpoint(1), opal-restart(1), orte_snapc(7), opal_crs(7) +. diff --git a/orte/mca/sstore/sstore.h b/orte/mca/sstore/sstore.h new file mode 100644 index 0000000000..d1ead7e939 --- /dev/null +++ b/orte/mca/sstore/sstore.h @@ -0,0 +1,404 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ +/** + * @file + * + * Distributed Stable Storage (SStore) Interface + * + */ + +#ifndef MCA_SSTORE_H +#define MCA_SSTORE_H + +#include "orte_config.h" +#include "orte/constants.h" +#include "orte/types.h" +#include "orte/runtime/orte_globals.h" + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" + +#include "opal/class/opal_object.h" + +BEGIN_C_DECLS + +/** + * Keys accepted as metadata + */ +typedef uint32_t orte_sstore_base_key_t; +/** CRS Component */ +#define SSTORE_METADATA_LOCAL_CRS_COMP 0 +/** Compress Component */ +#define SSTORE_METADATA_LOCAL_COMPRESS_COMP 1 +/** Compress Component Postfix */ +#define SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX 2 +/** Process PID */ +#define SSTORE_METADATA_LOCAL_PID 3 +/** Checkpoint Context File */ +#define SSTORE_METADATA_LOCAL_CONTEXT 4 +/** Directory to make on restart */ +#define SSTORE_METADATA_LOCAL_MKDIR 5 +/** File to touch on restart */ +#define SSTORE_METADATA_LOCAL_TOUCH 6 + +/** Local snapshot reference (e.g., opal_snapshot_0.ckpt) */ +#define SSTORE_METADATA_LOCAL_SNAP_REF 7 +/** Local snapshot reference format string (e.g., opal_snapshot_%d.ckpt) passed vpid */ +#define SSTORE_METADATA_LOCAL_SNAP_REF_FMT 8 +/** Local snapshot directory (Full Path excluding reference) */ +#define SSTORE_METADATA_LOCAL_SNAP_LOC 9 +/** Local snapshot reference directory (Full Path) */ +#define SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT 10 +/** Local snapshot metadata file (Full Path) */ +#define SSTORE_METADATA_LOCAL_SNAP_META 11 + +/** Global snapshot reference (e.g., ompi_global_snapshot_1234.ckpt) */ +#define SSTORE_METADATA_GLOBAL_SNAP_REF 12 +/** Global snapshot location (Relative Path from base) */ +#define SSTORE_METADATA_GLOBAL_SNAP_LOC 13 +/** Global snapshot location (Full path) */ +#define SSTORE_METADATA_GLOBAL_SNAP_LOC_ABS 14 +/** Global snapshot metadata file (Full path) */ +#define SSTORE_METADATA_GLOBAL_SNAP_META 15 +/** Global snapshot sequence number */ +#define SSTORE_METADATA_GLOBAL_SNAP_SEQ 16 +/** AMCA Parameter to be preserved for ompi-restart */ +#define SSTORE_METADATA_GLOBAL_AMCA_PARAM 17 + +/** Total number of sequence numbers for this snapshot */ +#define SSTORE_METADATA_GLOBAL_SNAP_NUM_SEQ 18 +/** Comma separated list of all sequence numbers for this snapshot */ +#define SSTORE_METADATA_GLOBAL_SNAP_ALL_SEQ 19 + +/** Access the current default base directory (Full Path) */ +#define SSTORE_METADATA_BASE_LOC 20 + +/** The local process is skipping the checkpoint + * Usually this is because there is a migration, and it is not participating + */ +#define SSTORE_METADATA_LOCAL_SKIP_CKPT 21 + +/** A Migration checkpoint does not necessarily contain all of the processes + * in the job, so it is not a checkpoint that can be restarted from normally. + * Therefore, it needs to be marked specially. */ +#define SSTORE_METADATA_GLOBAL_MIGRATING 22 + +/** */ +#define SSTORE_METADATA_MAX 23 + +/** + * Storage handle + */ +#define ORTE_SSTORE_HANDLE OPAL_UINT32 +typedef uint32_t orte_sstore_base_handle_t; +ORTE_DECLSPEC extern orte_sstore_base_handle_t orte_sstore_handle_current; +ORTE_DECLSPEC extern orte_sstore_base_handle_t orte_sstore_handle_last_stable; +#define ORTE_SSTORE_HANDLE_INVALID 0 + +/** + * Local and Global snapshot information structure + * Primarily used by orte-restart as an abstract way to handle metadata + */ +struct orte_sstore_base_local_snapshot_info_1_0_0_t { + /** List super object */ + opal_list_item_t super; + + /** Stable Storage Handle */ + orte_sstore_base_handle_t ss_handle; + + /** ORTE Process name */ + orte_process_name_t process_name; + + /** CRS Component */ + char *crs_comp; + + /** Compress Component */ + char *compress_comp; + + /** Compress Component Postfix */ + char *compress_postfix; + + /** Start/End Timestamps */ + char *start_time; + char *end_time; +}; +typedef struct orte_sstore_base_local_snapshot_info_1_0_0_t orte_sstore_base_local_snapshot_info_1_0_0_t; +typedef struct orte_sstore_base_local_snapshot_info_1_0_0_t orte_sstore_base_local_snapshot_info_t; + +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_base_local_snapshot_info_t); + +struct orte_sstore_base_global_snapshot_info_1_0_0_t { + /** List super object */ + opal_list_item_t super; + + /** A list of orte_sstore_base_local_snapshot_info_t's */ + opal_list_t local_snapshots; + + /** Stable Storage Handle */ + orte_sstore_base_handle_t ss_handle; + + /** Start Timestamp */ + char * start_time; + + /** End Timestamp */ + char * end_time; + + /** Sequence number */ + int seq_num; + + /** Reference */ + char *reference; + + /** AMCA parameter used */ + char *amca_param; + + /** Internal use only: Cache some information on the structure */ + int num_seqs; + char ** all_seqs; + char *basedir; + char *metadata_filename; +}; +typedef struct orte_sstore_base_global_snapshot_info_1_0_0_t orte_sstore_base_global_snapshot_info_1_0_0_t; +typedef struct orte_sstore_base_global_snapshot_info_1_0_0_t orte_sstore_base_global_snapshot_info_t; + +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_base_global_snapshot_info_t); + +/** + * Module initialization function. + * Returns ORTE_SUCCESS + */ +typedef int (*orte_sstore_base_module_init_fn_t) + (void); + +/** + * Module finalization function. + * Returns ORTE_SUCCESS + */ +typedef int (*orte_sstore_base_module_finalize_fn_t) + (void); + +/** + * Request a checkpoint storage handle from stable storage + * + * @param handle Checkpoint storage handle + * @param key Key to use as an identifier + * @param value Value of the key specified + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_request_checkpoint_handle_fn_t) + (orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); + +/** + * Request a restart storage handle from stable storage + * This function will fail if the key cannot be matched. + * If multiple matches exist, it will return the latest one. + * If they key is NULL, then the latest entry will be used. + * + * @param handle Restart storage handle + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_request_restart_handle_fn_t) + (orte_sstore_base_handle_t *handle, + char *basedir, char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot); + +/** + * Request snapshot info from a given handle. + * If they key is NULL, then the latest entry will be used. + * + * @param handle Restart storage handle + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_request_global_snapshot_data_fn_t) + (orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot); + +/** + * Register access to a handle. + * + * @param handle Storage handle + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_register_handle_fn_t) + (orte_sstore_base_handle_t handle); + +/** + * Get attribute on the storage handle + * + * @param handle Storage handle + * @param key Key to access + * @param value Value of the key. NULL if not avaialble + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_get_attribute_fn_t) + (orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); + +/** + * Set attribute on the storage handle + * + * @param handle Storage handle + * @param key Key to set + * @param value Value of the key. + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_set_attribute_fn_t) + (orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); + +/** + * Synchronize the handle + * + * @param handle Storage handle + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_sync_fn_t) + (orte_sstore_base_handle_t handle); + +/** + * Remove data associated with the handle + * + * @param handle Storage handle + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_remove_fn_t) + (orte_sstore_base_handle_t handle); + +/** + * Pack a handle into a buffer + * Only called between the HNP and ORTED (or Global and Local SnapC coordinators) + * + * @param peer Peer to which this is being sent (or NULL if to all peers) + * @param buffer Buffer to pack the data into + * @param handle Storage handle + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_pack_fn_t) + (orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); + +/** + * Unack a handle from a buffer + * Only called between the HNP and ORTED (or Global and Local SnapC coordinators) + * + * @param peer Peer from which this was received + * @param buffer Buffer to unpack the data + * @param handle Storage handle + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_unpack_fn_t) + (orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + +/** + * Fetch application context dependencies before local launch + * + * @param app Application context + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_fetch_app_deps_fn_t) + (orte_app_context_t *app); + +/** + * Wait for all application context dependencies to be fetched + * + * @return ORTE_SUCCESS on success + * @return ORTE_ERROR on failure + */ +typedef int (*orte_sstore_base_wait_all_deps_fn_t) + (void); + +/** + * Structure for SSTORE components. + */ +struct orte_sstore_base_component_2_0_0_t { + /** MCA base component */ + mca_base_component_t base_version; + /** MCA base data */ + mca_base_component_data_t base_data; + + /** Verbosity Level */ + int verbose; + /** Output Handle for opal_output */ + int output_handle; + /** Default Priority */ + int priority; +}; +typedef struct orte_sstore_base_component_2_0_0_t orte_sstore_base_component_2_0_0_t; +typedef struct orte_sstore_base_component_2_0_0_t orte_sstore_base_component_t; + +/** + * Structure for SSTORE modules + */ +struct orte_sstore_base_module_1_0_0_t { + /** Initialization Function */ + orte_sstore_base_module_init_fn_t sstore_init; + /** Finalization Function */ + orte_sstore_base_module_finalize_fn_t sstore_finalize; + + /** Request handle */ + orte_sstore_base_request_checkpoint_handle_fn_t request_checkpoint_handle; + orte_sstore_base_request_restart_handle_fn_t request_restart_handle; + orte_sstore_base_request_global_snapshot_data_fn_t request_global_snapshot_data; + orte_sstore_base_register_handle_fn_t register_handle; + + /** Get/Set Attributes */ + orte_sstore_base_get_attribute_fn_t get_attr; + orte_sstore_base_set_attribute_fn_t set_attr; + + /** Sync */ + orte_sstore_base_sync_fn_t sync; + + /** Remove */ + orte_sstore_base_remove_fn_t remove; + + /** Pack/Unpack Handle */ + orte_sstore_base_pack_fn_t pack_handle; + orte_sstore_base_unpack_fn_t unpack_handle; + + /** Launch Helpers */ + orte_sstore_base_fetch_app_deps_fn_t fetch_app_deps; + orte_sstore_base_wait_all_deps_fn_t wait_all_deps; +}; +typedef struct orte_sstore_base_module_1_0_0_t orte_sstore_base_module_1_0_0_t; +typedef struct orte_sstore_base_module_1_0_0_t orte_sstore_base_module_t; + +ORTE_DECLSPEC extern orte_sstore_base_module_t orte_sstore; + +/** + * Macro for use in components that are of type SSTORE + */ +#define ORTE_SSTORE_BASE_VERSION_2_0_0 \ + MCA_BASE_VERSION_2_0_0, \ + "sstore", 2, 0, 0 + +END_C_DECLS + +#endif /* ORTE_SSTORE_H */ + diff --git a/orte/mca/sstore/stage/Makefile.am b/orte/mca/sstore/stage/Makefile.am new file mode 100644 index 0000000000..3c15cd3c33 --- /dev/null +++ b/orte/mca/sstore/stage/Makefile.am @@ -0,0 +1,40 @@ +# +# Copyright (c) 2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +dist_pkgdata_DATA = help-orte-sstore-stage.txt + +sources = \ + sstore_stage.h \ + sstore_stage_component.c \ + sstore_stage_module.c \ + sstore_stage_global.c \ + sstore_stage_local.c \ + sstore_stage_app.c + +# Make the output library in this directory, and name it either +# mca__.la (for DSO builds) or libmca__.la +# (for static builds). + +if OMPI_BUILD_sstore_stage_DSO +component_noinst = +component_install = mca_sstore_stage.la +else +component_noinst = libmca_sstore_stage.la +component_install = +endif + +mcacomponentdir = $(pkglibdir) +mcacomponent_LTLIBRARIES = $(component_install) +mca_sstore_stage_la_SOURCES = $(sources) +mca_sstore_stage_la_LDFLAGS = -module -avoid-version + +noinst_LTLIBRARIES = $(component_noinst) +libmca_sstore_stage_la_SOURCES = $(sources) +libmca_sstore_stage_la_LDFLAGS = -module -avoid-version diff --git a/orte/mca/sstore/stage/configure.m4 b/orte/mca/sstore/stage/configure.m4 new file mode 100644 index 0000000000..c1604039fe --- /dev/null +++ b/orte/mca/sstore/stage/configure.m4 @@ -0,0 +1,20 @@ +# -*- shell-script -*- +# +# Copyright (c) 2010 The Trustees of Indiana University. +# All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +# MCA_sstore_stage_CONFIG([action-if-found], [action-if-not-found]) +# ----------------------------------------------------------- +AC_DEFUN([MCA_sstore_stage_CONFIG],[ + # If we don't want FT, don't compile this component + AS_IF([test "$opal_want_ft_cr" = "1"], + [$1], + [$2]) +])dnl diff --git a/orte/mca/sstore/stage/configure.params b/orte/mca/sstore/stage/configure.params new file mode 100644 index 0000000000..a6501a1b83 --- /dev/null +++ b/orte/mca/sstore/stage/configure.params @@ -0,0 +1,13 @@ +# -*- shell-script -*- +# +# Copyright (c) 2010 The Trustees of Indiana University. +# All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +PARAM_INIT_FILE=sstore_stage_component.c +PARAM_CONFIG_FILES="Makefile" diff --git a/orte/mca/sstore/stage/help-orte-sstore-stage.txt b/orte/mca/sstore/stage/help-orte-sstore-stage.txt new file mode 100644 index 0000000000..e0287991a4 --- /dev/null +++ b/orte/mca/sstore/stage/help-orte-sstore-stage.txt @@ -0,0 +1,26 @@ + -*- text -*- +# +# Copyright (c) 2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English general help file for ORTE SStore framework. +# +[fail_path_create] +Error: Failed to create the following directory. + Check to make sure this process/node can access the specified directory. +Process : %s +Node : %s +Directory: %s + +[caching_no_recovery] +Warning: Caching has been enabled, but ErrMgr recovery has not. + Node local caching of local snapshots is only used when recovering + a failed job automaticly using the ErrMgr recovery mechanism. + So this combination of options requires SStore to do extra work + from which it will receive no benefit. diff --git a/orte/mca/sstore/stage/sstore_stage.h b/orte/mca/sstore/stage/sstore_stage.h new file mode 100644 index 0000000000..a0a3930fc7 --- /dev/null +++ b/orte/mca/sstore/stage/sstore_stage.h @@ -0,0 +1,145 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * + * STAGE SSTORE component + * + */ + +#ifndef MCA_SSTORE_STAGE_EXPORT_H +#define MCA_SSTORE_STAGE_EXPORT_H + +#include "orte_config.h" + +#include "opal/mca/mca.h" + +#include "orte/mca/sstore/sstore.h" + +BEGIN_C_DECLS + +typedef uint8_t orte_sstore_stage_cmd_flag_t; +#define ORTE_SSTORE_STAGE_CMD OPAL_UINT8 +#define ORTE_SSTORE_STAGE_PULL 1 +#define ORTE_SSTORE_STAGE_PUSH 2 +#define ORTE_SSTORE_STAGE_REMOVE 3 +#define ORTE_SSTORE_STAGE_DONE 4 + +#define ORTE_SSTORE_LOCAL_SNAPSHOT_DIR_NAME ("openmpi-local-snapshot") +#define ORTE_SSTORE_LOCAL_SNAPSHOT_STAGE_DIR_NAME ("stage") +#define ORTE_SSTORE_LOCAL_SNAPSHOT_RESTART_DIR_NAME ("restart") +#define ORTE_SSTORE_LOCAL_SNAPSHOT_CACHE_DIR_NAME ("cache") + + /* + * Local Component structures + */ + struct orte_sstore_stage_component_t { + /** Base SSTORE component */ + orte_sstore_base_component_t super; + }; + typedef struct orte_sstore_stage_component_t orte_sstore_stage_component_t; + ORTE_MODULE_DECLSPEC extern orte_sstore_stage_component_t mca_sstore_stage_component; + + extern char * orte_sstore_stage_local_snapshot_dir; + extern bool orte_sstore_stage_global_is_shared; + extern bool orte_sstore_stage_skip_filem; + extern bool orte_sstore_stage_enabled_caching; + extern bool orte_sstore_stage_enabled_compression; + extern int orte_sstore_stage_compress_delay; + extern int orte_sstore_stage_progress_meter; + + int orte_sstore_stage_component_query(mca_base_module_t **module, int *priority); + + /* + * Module functions + */ + int orte_sstore_stage_module_init(void); + int orte_sstore_stage_module_finalize(void); + + int orte_sstore_stage_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); + int orte_sstore_stage_request_restart_handle(orte_sstore_base_handle_t *handle, char *basedir, char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot); + int orte_sstore_stage_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot); + int orte_sstore_stage_register(orte_sstore_base_handle_t handle); + + int orte_sstore_stage_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); + int orte_sstore_stage_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); + + int orte_sstore_stage_sync(orte_sstore_base_handle_t handle); + int orte_sstore_stage_remove(orte_sstore_base_handle_t handle); + + int orte_sstore_stage_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); + int orte_sstore_stage_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + + int orte_sstore_stage_fetch_app_deps(orte_app_context_t *app); + int orte_sstore_stage_wait_all_deps(void); + + /* + * HNP functions + */ +int orte_sstore_stage_global_module_init(void); +int orte_sstore_stage_global_module_finalize(void); +int orte_sstore_stage_global_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); +int orte_sstore_stage_global_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot); +int orte_sstore_stage_global_register(orte_sstore_base_handle_t handle); +int orte_sstore_stage_global_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); +int orte_sstore_stage_global_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); +int orte_sstore_stage_global_sync(orte_sstore_base_handle_t handle); +int orte_sstore_stage_global_remove(orte_sstore_base_handle_t handle); +int orte_sstore_stage_global_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); +int orte_sstore_stage_global_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + + /* + * Orted functions + */ +int orte_sstore_stage_local_module_init(void); +int orte_sstore_stage_local_module_finalize(void); +int orte_sstore_stage_local_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); +int orte_sstore_stage_local_register(orte_sstore_base_handle_t handle); +int orte_sstore_stage_local_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); +int orte_sstore_stage_local_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); +int orte_sstore_stage_local_sync(orte_sstore_base_handle_t handle); +int orte_sstore_stage_local_remove(orte_sstore_base_handle_t handle); +int orte_sstore_stage_local_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); +int orte_sstore_stage_local_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); +int orte_sstore_stage_local_fetch_app_deps(orte_app_context_t *app); +int orte_sstore_stage_local_wait_all_deps(void); + +void orte_sstore_stage_local_process_cmd(int fd, + short event, + void *cbdata); +int orte_sstore_stage_local_process_cmd_action(orte_process_name_t *sender, + orte_sstore_stage_cmd_flag_t command, + orte_sstore_base_handle_t loc_id, + opal_buffer_t* buffer); + /* + * Application functions + */ +int orte_sstore_stage_app_module_init(void); +int orte_sstore_stage_app_module_finalize(void); +int orte_sstore_stage_app_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid); +int orte_sstore_stage_app_register(orte_sstore_base_handle_t handle); +int orte_sstore_stage_app_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value); +int orte_sstore_stage_app_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value); +int orte_sstore_stage_app_sync(orte_sstore_base_handle_t handle); +int orte_sstore_stage_app_remove(orte_sstore_base_handle_t handle); +int orte_sstore_stage_app_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle); +int orte_sstore_stage_app_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle); + + /* + * Internal utility functions + */ + +END_C_DECLS + +#endif /* MCA_SSTORE_STAGE_EXPORT_H */ diff --git a/orte/mca/sstore/stage/sstore_stage_app.c b/orte/mca/sstore/stage/sstore_stage_app.c new file mode 100644 index 0000000000..00496a9d9c --- /dev/null +++ b/orte/mca/sstore/stage/sstore_stage_app.c @@ -0,0 +1,723 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "orte/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/show_help.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/os_dirpath.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/rml/rml_types.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_stage.h" + +/********** + * Object stuff + **********/ +struct orte_sstore_stage_app_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** */ + orte_sstore_base_handle_t id; + + /** Global Sequence Number */ + int seq_num; + + /** Global Reference Name */ + char * global_ref_name; + + /** Local Location (Absolute Path) */ + char * local_location; + + /** Metadata File Name (Absolute Path) */ + char *metadata_filename; + + /** Metadata File Descriptor */ + FILE *metadata; + + /** CRS Component used */ + char * crs_comp; + + /** Did this process skip the checkpoint? */ + bool ckpt_skipped; +}; +typedef struct orte_sstore_stage_app_snapshot_info_t orte_sstore_stage_app_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_stage_app_snapshot_info_t); + +void orte_sstore_stage_app_snapshot_info_construct(orte_sstore_stage_app_snapshot_info_t *info); +void orte_sstore_stage_app_snapshot_info_destruct( orte_sstore_stage_app_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_stage_app_snapshot_info_t, + opal_list_item_t, + orte_sstore_stage_app_snapshot_info_construct, + orte_sstore_stage_app_snapshot_info_destruct); + + +/********** + * Local Function and Variable Declarations + **********/ +static orte_sstore_stage_app_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle); +static orte_sstore_stage_app_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle); + +static int init_local_snapshot_directory(orte_sstore_stage_app_snapshot_info_t *handle_info); +static int pull_handle_info(orte_sstore_stage_app_snapshot_info_t *handle_info ); +static int push_handle_info(orte_sstore_stage_app_snapshot_info_t *handle_info ); + +static int metadata_open(orte_sstore_stage_app_snapshot_info_t * handle_info); +static int metadata_close(orte_sstore_stage_app_snapshot_info_t * handle_info); +static int metadata_write_str(orte_sstore_stage_app_snapshot_info_t * handle_info, char * key, char *value); +static int metadata_write_int(orte_sstore_stage_app_snapshot_info_t * handle_info, char *key, int value); +static int metadata_write_timestamp(orte_sstore_stage_app_snapshot_info_t * handle_info); + +static opal_list_t *active_handles = NULL; + +/********** + * Object stuff + **********/ +void orte_sstore_stage_app_snapshot_info_construct(orte_sstore_stage_app_snapshot_info_t *info) +{ + info->id = 0; + + info->seq_num = -1; + + info->global_ref_name = NULL; + info->local_location = NULL; + + info->metadata_filename = NULL; + info->metadata = NULL; + + info->crs_comp = NULL; + + info->ckpt_skipped = false; +} + +void orte_sstore_stage_app_snapshot_info_destruct( orte_sstore_stage_app_snapshot_info_t *info) +{ + info->id = 0; + info->seq_num = -1; + + if( NULL != info->global_ref_name ) { + free( info->global_ref_name ); + info->global_ref_name = NULL; + } + + if( NULL != info->local_location ) { + free( info->local_location ); + info->local_location = NULL; + } + + if( NULL != info->metadata_filename ) { + free( info->metadata_filename ) ; + info->metadata_filename = NULL; + } + + if( NULL != info->metadata ) { + fclose(info->metadata); + info->metadata = NULL; + } + + if( NULL != info->crs_comp ) { + free( info->crs_comp ); + info->crs_comp = NULL; + } + + info->ckpt_skipped = false; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_stage_app_module_init(void) +{ + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + return ORTE_SUCCESS; +} + +int orte_sstore_stage_app_module_finalize(void) +{ + if( NULL != active_handles ) { + OBJ_RELEASE(active_handles); + } + + return ORTE_SUCCESS; +} + +int orte_sstore_stage_app_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + opal_output(0, "sstore:stage:(app): request_checkpoint_handle() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_stage_app_register(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_app_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(app): register(%d)", (int)handle)); + + /* + * Create a handle + */ + orte_sstore_handle_current = handle; + handle_info = create_new_handle_info(handle); + + /* + * Get basic information from Local SStore + */ + if( ORTE_SUCCESS != (ret = pull_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Setup the storage directory + */ + if( ORTE_SUCCESS != (ret = init_local_snapshot_directory(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_app_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + int exit_status = ORTE_SUCCESS; + orte_sstore_stage_app_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(app): get_attr(%d)", key)); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Access metadata + */ + if( SSTORE_METADATA_GLOBAL_SNAP_SEQ == key ) { + asprintf(value, "%d", handle_info->seq_num); + } + else if( SSTORE_METADATA_LOCAL_SNAP_LOC == key) { + *value = strdup(handle_info->local_location); + } + else if( SSTORE_METADATA_LOCAL_SNAP_META == key ) { + *value = strdup(handle_info->metadata_filename); + } + else if( SSTORE_METADATA_GLOBAL_SNAP_REF == key ) { + *value = strdup(handle_info->global_ref_name); + } + else { + exit_status = ORTE_ERR_NOT_SUPPORTED; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_app_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_app_snapshot_info_t *handle_info = NULL; + char *key_str = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(app): set_attr(%d = %s)", key, value)); + + if( NULL == value ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if( key >= SSTORE_METADATA_MAX ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Access metadata + */ + if( SSTORE_METADATA_LOCAL_CRS_COMP == key ) { + if( NULL != handle_info->crs_comp ) { + free(handle_info->crs_comp); + } + handle_info->crs_comp = strdup(value); + } + else if(SSTORE_METADATA_LOCAL_SKIP_CKPT == key ) { + handle_info->ckpt_skipped = true; + } + else if( SSTORE_METADATA_LOCAL_MKDIR == key || + SSTORE_METADATA_LOCAL_TOUCH == key ) { + orte_sstore_base_convert_key_to_string(key, &key_str); + if( ORTE_SUCCESS != (ret = metadata_write_str(handle_info, key_str, value))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + else { + exit_status = ORTE_ERROR; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_app_sync(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_app_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(app): sync()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Finalize and close the metadata + */ + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_close(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Push information to the Local coordinator + */ + if( ORTE_SUCCESS != (ret = push_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + orte_sstore_handle_current = ORTE_SSTORE_HANDLE_INVALID; + + return exit_status; +} + +int orte_sstore_stage_app_remove(orte_sstore_base_handle_t handle) +{ + opal_output(0, "sstore:stage:(app): remove() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_stage_app_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + opal_output(0, "sstore:stage:(app): pack() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_stage_app_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + opal_output(0, "sstore:stage:(app): unpack() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +/************************** + * Local functions + **************************/ +static orte_sstore_stage_app_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_stage_app_snapshot_info_t *handle_info = NULL; + + handle_info = OBJ_NEW(orte_sstore_stage_app_snapshot_info_t); + + handle_info->id = handle; + + opal_list_append(active_handles, &(handle_info->super)); + + return handle_info; +} + +static orte_sstore_stage_app_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_stage_app_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_stage_app_snapshot_info_t*)item; + + if( handle_info->id == handle ) { + return handle_info; + } + } + + return NULL; +} + +static int pull_handle_info(orte_sstore_stage_app_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_stage_cmd_flag_t command; + orte_std_cntr_t count; + orte_sstore_base_handle_t loc_id; + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + /* + * Ask the daemon to send us the info that we need + */ + command = ORTE_SSTORE_STAGE_PULL; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Receive the response + */ + OBJ_DESTRUCT(&buffer); + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(app): pull() from %s -> %s", + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + ORTE_NAME_PRINT(ORTE_PROC_MY_DAEMON))); + if( ORTE_SUCCESS != (ret = orte_rml.recv_buffer(ORTE_PROC_MY_DAEMON, + &buffer, + ORTE_RML_TAG_SSTORE_INTERNAL, + 0)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &command, &count, ORTE_SSTORE_STAGE_CMD))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &loc_id, &count, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + if( loc_id != handle_info->id ) { + ; /* JJH Big problem */ + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->seq_num), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->global_ref_name), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->local_location), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(&buffer, &(handle_info->metadata_filename), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + +static int push_handle_info(orte_sstore_stage_app_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_stage_cmd_flag_t command; + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + command = ORTE_SSTORE_STAGE_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->ckpt_skipped), 1, OPAL_BOOL )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !handle_info->ckpt_skipped ) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->crs_comp), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_DAEMON, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + +static int init_local_snapshot_directory(orte_sstore_stage_app_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + mode_t my_mode = S_IRWXU; + + /* + * Make the snapshot directory from the uniq_global_snapshot_name + */ + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(handle_info->local_location, my_mode)) ) { + opal_show_help("help-orte-sstore-stage.txt", "fail_path_create", true, + ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), + orte_process_info.nodename, + handle_info->local_location); + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Open up the metadata file + */ + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Add a timestamp and the PID of this process + */ + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, SSTORE_METADATA_LOCAL_PID_STR, (int)getpid())) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_close(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + + +/************************** + * Metadata functions + **************************/ +static int metadata_open(orte_sstore_stage_app_snapshot_info_t * handle_info) +{ + /* If already open, then just return */ + if( NULL != handle_info->metadata ) { + return ORTE_SUCCESS; + } + + if (NULL == (handle_info->metadata = fopen(handle_info->metadata_filename, "a")) ) { + opal_output(orte_sstore_base_output, + "sstore:stage:(global):init_dir() Unable to open the file (%s)\n", + handle_info->metadata_filename); + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + return ORTE_SUCCESS; +} + +static int metadata_close(orte_sstore_stage_app_snapshot_info_t * handle_info) +{ + /* If already closed, then just return */ + if( NULL == handle_info->metadata ) { + return ORTE_SUCCESS; + } + + fclose(handle_info->metadata); + handle_info->metadata = NULL; + + return ORTE_SUCCESS; +} + +static int metadata_write_str(orte_sstore_stage_app_snapshot_info_t * handle_info, char *key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%s\n", key, value); + + cleanup: + /* Must close the metadata each time, since if we try to checkpoint the + * CRS might want to restore the FD, and will likely fail if the snapshot + * moved */ + if( NULL != handle_info->metadata ) { + fclose(handle_info->metadata); + handle_info->metadata = NULL; + } + + return exit_status; +} + +static int metadata_write_int(orte_sstore_stage_app_snapshot_info_t * handle_info, char *key, int value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%d\n", key, value); + + cleanup: + return exit_status; +} + +static int metadata_write_timestamp(orte_sstore_stage_app_snapshot_info_t * handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + time_t timestamp; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + timestamp = time(NULL); + fprintf(handle_info->metadata, "%s%s", SSTORE_METADATA_INTERNAL_TIME_STR, ctime(×tamp)); + + cleanup: + return exit_status; +} diff --git a/orte/mca/sstore/stage/sstore_stage_component.c b/orte/mca/sstore/stage/sstore_stage_component.c new file mode 100644 index 0000000000..da4a617f14 --- /dev/null +++ b/orte/mca/sstore/stage/sstore_stage_component.c @@ -0,0 +1,235 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +#include "orte_config.h" +#include "opal/util/output.h" +#include "orte/constants.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" +#include "sstore_stage.h" + +/* + * Public string for version number + */ +const char *orte_sstore_stage_component_version_string = + "ORTE SSTORE stage MCA component version " ORTE_VERSION; + +/* + * Local functionality + */ +static int sstore_stage_open(void); +static int sstore_stage_close(void); + +/* + * Instantiate the public struct with all of our public information + * and pointer to our public functions in it + */ +orte_sstore_stage_component_t mca_sstore_stage_component = { + /* First do the base component stuff */ + { + /* Handle the general mca_component_t struct containing + * meta information about the component itstage + */ + { + ORTE_SSTORE_BASE_VERSION_2_0_0, + /* Component name and version */ + "stage", + ORTE_MAJOR_VERSION, + ORTE_MINOR_VERSION, + ORTE_RELEASE_VERSION, + + /* Component open and close functions */ + sstore_stage_open, + sstore_stage_close, + orte_sstore_stage_component_query + }, + { + /* The component is checkpoint ready */ + MCA_BASE_METADATA_PARAM_CHECKPOINT + }, + + /* Verbosity level */ + 0, + /* opal_output handler */ + -1, + /* Default priority */ + 10 + }, +}; + +char * orte_sstore_stage_local_snapshot_dir = NULL; +bool orte_sstore_stage_global_is_shared = false; +bool orte_sstore_stage_skip_filem = false; +bool orte_sstore_stage_enabled_caching = false; +bool orte_sstore_stage_enabled_compression = false; +int orte_sstore_stage_compress_delay = 0; +int orte_sstore_stage_progress_meter = 0; + +static int sstore_stage_open(void) +{ + int mca_index, value; + + /* + * The local directory to use when staging checkpoints back to central storage + */ + mca_index = mca_base_param_reg_string(&mca_sstore_stage_component.super.base_version, + "local_snapshot_dir", + "The temporary base directory to use when storing local snapshots before they are moved.", + true, false, + opal_tmp_directory(), + &orte_sstore_stage_local_snapshot_dir); + mca_base_param_reg_syn_name(mca_index, "crs", "base_snapshot_dir", true); + + /* + * If the global storage is just on a different file system, then we pass + * this hint on to FileM. + */ + mca_index = mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "global_is_shared", + "If the global_snapshot_dir is on a shared file system all nodes can access, " + "then the checkpoint files can be copied more efficiently when FileM is used." + " [Default = disabled]", + false, false, + 0, + &value); + mca_base_param_reg_syn_name(mca_index, "snapc", "base_global_shared", true); + + orte_sstore_stage_global_is_shared = OPAL_INT_TO_BOOL(value); + + /* + * Debugging option to skip the filem step + * Warning: Will not produce a usable global snapshot + */ + mca_index = mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "skip_filem", + "Not for general use! For debugging only! Pretend to move files. [Default = disabled]", + false, false, + 0, + &value); + mca_base_param_reg_syn_name(mca_index, "snapc", "base_skip_filem", true); + + orte_sstore_stage_skip_filem = OPAL_INT_TO_BOOL(value); + + /* + * Maintain a local cache of checkpoints taken, so that automatic recovery + * does not require a transfer from central storage. + */ + mca_index = mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "caching", + "Maintain a node local cache of last checkpoint. [Default = disabled]", + false, false, + 0, + &value); + orte_sstore_stage_enabled_caching = OPAL_INT_TO_BOOL(value); + + /* + * Compress checkpoints before/after transfer + */ + mca_index = mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "compress", + "Compress local snapshots. [Default = disabled]", + false, false, + 0, + &value); + orte_sstore_stage_enabled_compression = OPAL_INT_TO_BOOL(value); + + /* + * Number of seconds to delay the start of compression when sync'ing + */ + mca_index = mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "compress_delay", + "Seconds to delay the start of compression on sync() " + " [Default = 0]", + false, false, + 0, + &value); + orte_sstore_stage_compress_delay = value; + + /* + * A progress meter + */ + mca_index = mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "progress_meter", + "Display Progress every X percentage done. [Default = 0/off]", + false, false, + 0, + &value); + orte_sstore_stage_progress_meter = (value % 101); + + /* + * Priority + */ + mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "priority", + "Priority of the SSTORE stage component", + false, false, + mca_sstore_stage_component.super.priority, + &mca_sstore_stage_component.super.priority); + /* + * Verbose Level + */ + mca_base_param_reg_int(&mca_sstore_stage_component.super.base_version, + "verbose", + "Verbose level for the SSTORE stage component", + false, false, + mca_sstore_stage_component.super.verbose, + &mca_sstore_stage_component.super.verbose); + /* If there is a custom verbose level for this component than use it + * otherwise take our parents level and output channel + */ + if ( 0 != mca_sstore_stage_component.super.verbose) { + mca_sstore_stage_component.super.output_handle = opal_output_open(NULL); + opal_output_set_verbosity(mca_sstore_stage_component.super.output_handle, + mca_sstore_stage_component.super.verbose); + } else { + mca_sstore_stage_component.super.output_handle = orte_sstore_base_output; + } + + /* + * Debug Output + */ + opal_output_verbose(10, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open()"); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: priority = %d", + mca_sstore_stage_component.super.priority); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: verbosity = %d", + mca_sstore_stage_component.super.verbose); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: Local snapshot directory = %s", + orte_sstore_stage_local_snapshot_dir); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: Is Global dir. shared = %s", + (orte_sstore_stage_global_is_shared ? "True" : "False")); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: Node Local Caching = %s", + (orte_sstore_stage_enabled_caching ? "Enabled" : "Disabled")); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: Compression = %s", + (orte_sstore_stage_enabled_compression ? "Enabled" : "Disabled")); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: Compression Delay = %d", + orte_sstore_stage_compress_delay); + opal_output_verbose(20, mca_sstore_stage_component.super.output_handle, + "sstore:stage: open: Skip FileM (Debug Only) = %s", + (orte_sstore_stage_skip_filem ? "True" : "False")); + + return ORTE_SUCCESS; +} + +static int sstore_stage_close(void) +{ + opal_output_verbose(10, mca_sstore_stage_component.super.output_handle, + "sstore:stage: close()"); + + return ORTE_SUCCESS; +} diff --git a/orte/mca/sstore/stage/sstore_stage_global.c b/orte/mca/sstore/stage/sstore_stage_global.c new file mode 100644 index 0000000000..f0a7b9b9a9 --- /dev/null +++ b/orte/mca/sstore/stage/sstore_stage_global.c @@ -0,0 +1,1763 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/show_help.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/os_dirpath.h" +#include "opal/util/opal_getcwd.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/show_help.h" +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" +#include "orte/mca/errmgr/base/errmgr_private.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/filem/filem.h" +#include "orte/mca/grpcomm/grpcomm.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/mca/snapc/base/base.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_stage.h" + +#define SSTORE_HANDLE_TYPE_NONE 0 +#define SSTORE_HANDLE_TYPE_CKPT 1 +#define SSTORE_HANDLE_TYPE_RESTART 2 + +#define SSTORE_GLOBAL_NONE 0 +#define SSTORE_GLOBAL_ERROR 1 +#define SSTORE_GLOBAL_INIT 2 +#define SSTORE_GLOBAL_REG 3 +#define SSTORE_GLOBAL_SYNCING 4 +#define SSTORE_GLOBAL_SYNCED 5 + +/********** + * Object Stuff + **********/ +struct orte_sstore_stage_global_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** */ + orte_sstore_base_handle_t id; + + /** Job ID */ + orte_jobid_t jobid; + + /** State */ + int state; + + /** Handle type */ + int handle_type; + + /** Sequence Number */ + int seq_num; + + /** Reference Name */ + char * ref_name; + + /** Local Location (Relative Path to base_location) */ + char * local_location; + + /** Application location format (Global) */ + char * app_global_location_fmt; + + /** Application location format (Local) */ + char * app_local_location_fmt; + + /** Application location format (Local) */ + char * app_local_cache_location_fmt; + + /** Base location */ + char * base_location; + + /** Metadata File Name */ + char *metadata_filename; + + /** Metadata File Descriptor */ + FILE *metadata; + + /** Num procs reported as locally synced */ + int num_procs_synced; + + /** Num procs reported as done */ + int num_procs_done; + + /** Num procs total in job */ + int num_procs_total; + + /** List of FileM Requests to wait upon */ + opal_list_t *filem_requests; + + /** Is this checkpoint representing a migration? */ + bool migrating; + + /** JJH: Assume all processes are compressed the same way */ + char * compress_comp; + char * compress_postfix; + + /** Progress Meter */ + double last_progress_report; +}; +typedef struct orte_sstore_stage_global_snapshot_info_t orte_sstore_stage_global_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_stage_global_snapshot_info_t); + +void orte_sstore_stage_global_snapshot_info_construct(orte_sstore_stage_global_snapshot_info_t *info); +void orte_sstore_stage_global_snapshot_info_destruct( orte_sstore_stage_global_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_stage_global_snapshot_info_t, + opal_list_item_t, + orte_sstore_stage_global_snapshot_info_construct, + orte_sstore_stage_global_snapshot_info_destruct); + + +/********** + * Local Function and Variable Declarations + **********/ +static bool is_global_listener_active = false; +static int sstore_stage_global_start_listener(void); +static int sstore_stage_global_stop_listener(void); +static void sstore_stage_global_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata); +static void sstore_stage_global_process_cmd(int fd, + short event, + void *cbdata); +static int process_local_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_global_snapshot_info_t *handle_info); +static int process_local_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_global_snapshot_info_t *handle_info); +static int process_local_done(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_global_snapshot_info_t *handle_info); +static int xcast_remove_all(orte_sstore_stage_global_snapshot_info_t *handle_info); + +static orte_sstore_stage_global_snapshot_info_t *create_new_handle_info(int seq, int type, orte_jobid_t jobid); +static orte_sstore_stage_global_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle); + +static int metadata_open(orte_sstore_stage_global_snapshot_info_t * handle_info); +static int metadata_close(orte_sstore_stage_global_snapshot_info_t * handle_info); +static int metadata_write_int(orte_sstore_stage_global_snapshot_info_t * handle_info, char * key, int value); +static int metadata_write_str(orte_sstore_stage_global_snapshot_info_t * handle_info, char * key, char *value); +static int metadata_write_timestamp(orte_sstore_stage_global_snapshot_info_t * handle_info); + +static int init_global_snapshot_directory(orte_sstore_stage_global_snapshot_info_t *handle_info); +static int stage_snapshot_sort_compare_fn(opal_list_item_t **a, + opal_list_item_t **b); +static int orte_sstore_stage_extract_global_metadata(orte_sstore_stage_global_snapshot_info_t * handle_info, + orte_sstore_base_global_snapshot_info_t *global_snapshot); + +static int wait_all_filem(orte_sstore_stage_global_snapshot_info_t *handle_info); +static void sync_global_dir(orte_sstore_stage_global_snapshot_info_t *handle_info); + +static int next_handle_id = 1; +static opal_list_t *active_handles = NULL; + +/* + * Progress + */ +static void sstore_stage_report_progress(orte_sstore_stage_global_snapshot_info_t *handle_info); + +#define SSTORE_STAGE_REPORT_PROGRESS(handle_info) \ + { \ + if(OPAL_UNLIKELY(orte_sstore_stage_progress_meter > 0)) { \ + sstore_stage_report_progress(handle_info); \ + } \ + } + +/********** + * Object stuff + **********/ +void orte_sstore_stage_global_snapshot_info_construct(orte_sstore_stage_global_snapshot_info_t *info) +{ + info->id = next_handle_id; + next_handle_id++; + + info->jobid = ORTE_JOBID_INVALID; + + info->state = SSTORE_GLOBAL_NONE; + + info->handle_type = SSTORE_HANDLE_TYPE_NONE; + + info->seq_num = -1; + + info->base_location = strdup(orte_sstore_base_global_snapshot_dir); + + info->ref_name = NULL; + info->local_location = NULL; + info->app_global_location_fmt = NULL; + info->app_local_location_fmt = NULL; + info->app_local_cache_location_fmt = NULL; + + info->metadata_filename = NULL; + info->metadata = NULL; + + info->filem_requests = OBJ_NEW(opal_list_t); + + info->num_procs_synced = 0; + info->num_procs_done = 0; + info->num_procs_total = 0; + + info->migrating = false; + + info->compress_comp = NULL; + info->compress_postfix = NULL; + + info->last_progress_report = 0.0; +} + +void orte_sstore_stage_global_snapshot_info_destruct( orte_sstore_stage_global_snapshot_info_t *info) +{ + info->id = 0; + info->seq_num = -1; + + info->jobid = ORTE_JOBID_INVALID; + + info->state = SSTORE_GLOBAL_NONE; + + info->handle_type = SSTORE_HANDLE_TYPE_NONE; + + if( NULL != info->ref_name ) { + free( info->ref_name ); + info->ref_name = NULL; + } + + if( NULL != info->local_location ) { + free( info->local_location ); + info->local_location = NULL; + } + + if( NULL != info->app_global_location_fmt ) { + free( info->app_global_location_fmt ); + info->app_global_location_fmt = NULL; + } + + if( NULL != info->app_local_location_fmt ) { + free( info->app_local_location_fmt ); + info->app_local_location_fmt = NULL; + } + + if( NULL != info->app_local_cache_location_fmt ) { + free( info->app_local_cache_location_fmt ); + info->app_local_cache_location_fmt = NULL; + } + + if( NULL != info->base_location ) { + free( info->base_location ); + info->base_location = NULL; + } + + if( NULL != info->metadata_filename ) { + free( info->metadata_filename ) ; + info->metadata_filename = NULL; + } + + if( NULL != info->metadata ) { + fclose(info->metadata); + info->metadata = NULL; + } + + if( NULL != info->filem_requests ) { + OBJ_RELEASE(info->filem_requests); + info->filem_requests = NULL; + } + + info->num_procs_synced = 0; + info->num_procs_done = 0; + info->num_procs_total = 0; + + info->migrating = false; + + if( NULL != info->compress_comp ) { + free(info->compress_comp); + info->compress_comp = NULL; + } + + if( NULL != info->compress_postfix ) { + free(info->compress_postfix); + info->compress_postfix = NULL; + } + + info->last_progress_report = 0.0; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_stage_global_module_init(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + /* + * If user has not enabled recovery, but enabled Caching then caching does + * not benefit the job. Continue using it, but warn the user. + */ + if( orte_sstore_stage_enabled_caching && !orte_enable_recovery ) { + opal_show_help("help-orte-sstore-stage.txt", "caching_no_recovery", true); + } + + /* + * Setup a listener for the HNP/Apps + */ + if( ORTE_SUCCESS != (ret = sstore_stage_global_start_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + exit_status = orte_sstore_stage_local_module_init(); + + cleanup: + return exit_status; +} + +int orte_sstore_stage_global_module_finalize(void) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + bool done = false; + int cur_time = 0, max_time = 120; + + /* + * Wait for all active transfers to finish + */ + done = false; + while( 0 < opal_list_get_size(active_handles) && !done ) { + done = true; + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_stage_global_snapshot_info_t*)item; + if( SSTORE_GLOBAL_SYNCED != handle_info->state && + SSTORE_GLOBAL_NONE != handle_info->state ) { + done = false; + break; + } + } + if( done ) { + break; + } + else { + if( cur_time != 0 && cur_time % 30 == 0 ) { + opal_output(0, "---> Waiting for sync(): %3d / %3d\n", + cur_time, max_time); + } + + opal_progress(); + if( cur_time >= max_time ) { + break; + } else { + sleep(1); + } + cur_time++; + } + } + + exit_status = orte_sstore_stage_local_module_finalize(); + + if( NULL != active_handles ) { + OBJ_RELEASE(active_handles); + } + + /* + * Shutdown the listener for the HNP/Apps + */ + if( ORTE_SUCCESS != (ret = sstore_stage_global_stop_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_global_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): request_checkpoint_handle()")); + + /* + * Construct a handle + * - Associate all of the necessary information + */ + handle_info = create_new_handle_info(seq, SSTORE_HANDLE_TYPE_CKPT, jobid); + + /* + * Create the global checkpoint directory + */ + if( ORTE_SUCCESS != (ret = init_global_snapshot_directory(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Return the handle + */ + *handle = handle_info->id; + + cleanup: + return exit_status; +} + +int orte_sstore_stage_global_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): request_global_snapshot_data()")); + + /* + * Lookup the handle (if NULL, use last stable) + */ + if( NULL != handle ) { + handle_info = find_handle_info(*handle); + snapshot->ss_handle = *handle; + } else { + handle_info = find_handle_info(orte_sstore_handle_last_stable); + snapshot->ss_handle = orte_sstore_handle_last_stable; + } + + /* + * Construct the snapshot from local data, and metadata file + */ + snapshot->seq_num = handle_info->seq_num; + snapshot->reference = strdup(handle_info->ref_name); + snapshot->basedir = strdup(handle_info->base_location); + snapshot->metadata_filename = strdup(handle_info->metadata_filename); + + /* If this is the current checkpoint, pull data from local cache */ + if( orte_sstore_handle_current == snapshot->ss_handle ) { + if( ORTE_SUCCESS != (ret = orte_sstore_stage_extract_global_metadata(handle_info, snapshot)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + /* Otherwise, pull from metadata */ + else { + if( ORTE_SUCCESS != (ret = orte_sstore_base_extract_global_metadata(snapshot)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + opal_list_sort(&snapshot->local_snapshots, stage_snapshot_sort_compare_fn); + + cleanup: + return exit_status; +} + +int orte_sstore_stage_global_register(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): register(%d) - Global", handle)); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + if( SSTORE_GLOBAL_REG != handle_info->state ) { + handle_info->state = SSTORE_GLOBAL_REG; + } else { + return orte_sstore_stage_local_register(handle); + } + + orte_sstore_handle_current = handle; + + /* + * Associate the metadata + */ + if( handle_info->migrating ) { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_INTERNAL_MIG_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } else { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_GLOBAL_SNAP_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if( ORTE_SUCCESS != (ret = metadata_write_str(handle_info, + SSTORE_METADATA_LOCAL_SNAP_REF_FMT_STR, + orte_sstore_base_local_snapshot_fmt)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_global_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + int exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): get_attr()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Access metadata + */ + /* Used by snapc */ + if( SSTORE_METADATA_GLOBAL_SNAP_REF == key ) { + *value = strdup(handle_info->ref_name); + } + /* Used by snapc */ + else if( SSTORE_METADATA_GLOBAL_SNAP_SEQ == key ) { + asprintf(value, "%d", handle_info->seq_num); + } + /* Used by orte-restart and RecoS and snapc (kinda) */ + else if( SSTORE_METADATA_LOCAL_SNAP_LOC == key ) { + asprintf(value, "%s/%s/%d", + handle_info->base_location, + handle_info->ref_name, + handle_info->seq_num); + } + /* Used by orte-restart and RecoS */ + else if( SSTORE_METADATA_LOCAL_SNAP_REF_FMT == key ) { + *value = strdup(orte_sstore_base_local_snapshot_fmt); + } + /* Used by orte-restart and RecoS */ + else if( SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT == key ) { + asprintf(value, "%s/%s/%d/%s", + handle_info->base_location, + handle_info->ref_name, + handle_info->seq_num, + orte_sstore_base_local_snapshot_fmt); + } + else { + exit_status = ORTE_ERR_NOT_SUPPORTED; + } + + return exit_status; +} + +int orte_sstore_stage_global_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + char *key_str = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): set_attr()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Process key (Access metadata) + */ + if( key == SSTORE_METADATA_GLOBAL_MIGRATING ) { + handle_info->migrating = true; + } + else { + orte_sstore_base_convert_key_to_string(key, &key_str); + if( NULL == key_str ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = metadata_write_str(handle_info, key_str, value))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + if( NULL != key_str ) { + free(key_str); + key_str = NULL; + } + + return exit_status; +} + +int orte_sstore_stage_global_sync(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): sync()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + if( SSTORE_GLOBAL_SYNCING != handle_info->state ) { + handle_info->state = SSTORE_GLOBAL_SYNCING; + if( ORTE_SNAPC_LOCAL_COORD_TYPE == (orte_snapc_coord_type & ORTE_SNAPC_LOCAL_COORD_TYPE) ) { + return orte_sstore_stage_local_sync(handle); + } + } + + /* + * Wait for all the processes to report in before waiting on all the requests + */ + while(handle_info->num_procs_synced < handle_info->num_procs_total) { + opal_progress(); + } + + /* + * Synchronize all of the files + * Wait on FileM operations + */ + if( !orte_sstore_stage_skip_filem ) { + if( ORTE_SUCCESS != (ret = wait_all_filem(handle_info))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + /* + * Finalize and close the metadata + */ + if( ORTE_SUCCESS != (ret = metadata_write_timestamp(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( handle_info->migrating ) { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_INTERNAL_DONE_MIG_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } else { + if( ORTE_SUCCESS != (ret = metadata_write_int(handle_info, + SSTORE_METADATA_INTERNAL_DONE_SEQ_STR, + handle_info->seq_num)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if( ORTE_SUCCESS != (ret = metadata_close(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* JJH: We should lock this var! */ + if( !handle_info->migrating ) { + orte_sstore_base_is_checkpoint_available = true; + orte_sstore_handle_last_stable = orte_sstore_handle_current; + } + + handle_info->state = SSTORE_GLOBAL_SYNCED; + + cleanup: + return exit_status; +} + +static void sync_global_dir(orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + opal_list_item_t* item = NULL, *f_item = NULL; + orte_filem_base_request_t *filem_request = NULL; + orte_filem_base_file_set_t * f_set = NULL; + char * fs_str = NULL; + char cwd[OPAL_PATH_MAX]; + + opal_getcwd(cwd, OPAL_PATH_MAX); + + /* + * Sync the Sequence num dir + */ + asprintf(&fs_str, "%s/%s/%d", + handle_info->base_location, + handle_info->ref_name, + handle_info->seq_num); + OPAL_OUTPUT_VERBOSE((20, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): sync_dir(): Sync'ing on %s", + fs_str)); + if( 0 != chdir(fs_str) ) { + opal_output(0, "sstore:stage:(global): Failed to chdir(%s)", + fs_str); + goto cleanup; + } + system("sync ; sync ; ls > /dev/null"); + + /* + * Sync each of the local snapshots + * if compressing, then this is already covered above + */ + if( orte_sstore_stage_enabled_compression ) { + goto cleanup; + } + + for(f_item = opal_list_get_first(handle_info->filem_requests); + f_item != opal_list_get_end(handle_info->filem_requests); + f_item = opal_list_get_next(f_item) ) { + filem_request = (orte_filem_base_request_t *)f_item; + + for(item = opal_list_get_first(&(filem_request->file_sets)); + item != opal_list_get_end(&(filem_request->file_sets)); + item = opal_list_get_next(item) ) { + f_set = (orte_filem_base_file_set_t *) item; + + if( NULL != fs_str ) { + free(fs_str); + fs_str = NULL; + } + + if( ORTE_FILEM_TYPE_FILE != f_set->target_flag ) { + OPAL_OUTPUT_VERBOSE((20, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): sync_dir(): Sync'ing on %s", + f_set->local_target)); + if( 0 != chdir(f_set->local_target) ) { + opal_output(0, "sstore:stage:(global): Failed to chdir(%s)", + f_set->local_target); + } else { + system("sync ; sync "); + } + } + } + } + + cleanup: + chdir(cwd); + + if( NULL != fs_str ) { + free(fs_str); + fs_str = NULL; + } + + return; +} + +int orte_sstore_stage_global_remove(orte_sstore_base_handle_t handle) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): remove()")); + + /* + * Lookup the handle + */ + + return ORTE_SUCCESS; +} + +int orte_sstore_stage_global_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): pack()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Pack the handle ID + */ + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &handle, 1, ORTE_SSTORE_HANDLE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Pack any metadata + */ + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->seq_num), 1, OPAL_INT )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->ref_name), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->app_local_location_fmt), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( orte_sstore_stage_enabled_caching ) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->app_local_cache_location_fmt), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->migrating), 1, OPAL_BOOL )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_global_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): unpack()")); + + /* + * Unpack the handle id + */ + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_JOBID, + ORTE_PROC_MY_NAME, + peer)) { + /* + * Differ to the orted version, so if we have application then they get updated too + */ + if( ORTE_SUCCESS != (ret = orte_sstore_stage_local_unpack(peer, buffer, handle)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + return exit_status; +} + +/************************** + * Local functions + **************************/ +static orte_sstore_stage_global_snapshot_info_t *create_new_handle_info(int seq, int type, orte_jobid_t jobid) +{ + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + orte_job_t *jdata = NULL; + + handle_info = OBJ_NEW(orte_sstore_stage_global_snapshot_info_t); + + handle_info->jobid = jobid; + + handle_info->state = SSTORE_GLOBAL_INIT; + + handle_info->handle_type = type; + + handle_info->seq_num = seq; + + orte_sstore_base_get_global_snapshot_ref(&(handle_info->ref_name), getpid()); + + asprintf(&(handle_info->local_location), "%s/%d", + handle_info->ref_name, handle_info->seq_num); + + /* This is used by the application to establish the local directory */ + asprintf(&(handle_info->app_local_location_fmt), "%s/%s/%s/%s", + orte_sstore_stage_local_snapshot_dir, + ORTE_SSTORE_LOCAL_SNAPSHOT_DIR_NAME, + ORTE_SSTORE_LOCAL_SNAPSHOT_STAGE_DIR_NAME, + orte_sstore_base_local_snapshot_fmt); + + if( orte_sstore_stage_enabled_caching ) { + asprintf(&(handle_info->app_local_cache_location_fmt), "%s/%s/%s/%d/%s", + orte_sstore_stage_local_snapshot_dir, + ORTE_SSTORE_LOCAL_SNAPSHOT_DIR_NAME, + ORTE_SSTORE_LOCAL_SNAPSHOT_CACHE_DIR_NAME, + handle_info->seq_num, + orte_sstore_base_local_snapshot_fmt); + } + + /* This is used by the HNP to remember where it should place each process */ + asprintf(&(handle_info->app_global_location_fmt), "%s/%s/%s", + handle_info->base_location, + handle_info->local_location, + orte_sstore_base_local_snapshot_fmt); + + asprintf(&(handle_info->metadata_filename), "%s/%s/%s", + handle_info->base_location, + handle_info->ref_name, + orte_sstore_base_global_metadata_filename); + + jdata = orte_get_job_data_object(handle_info->jobid); + handle_info->num_procs_total = (int)jdata->num_procs; + + opal_list_append(active_handles, &(handle_info->super)); + + return handle_info; +} + +static orte_sstore_stage_global_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_stage_global_snapshot_info_t*)item; + + if( handle_info->id == handle ) { + return handle_info; + } + } + + return NULL; +} + +static int sstore_stage_global_start_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if( is_global_listener_active ) { + return ORTE_SUCCESS; + } + + if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL, + ORTE_RML_PERSISTENT, + sstore_stage_global_recv, + NULL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = true; + + cleanup: + return exit_status; +} + +static int sstore_stage_global_stop_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = false; + + cleanup: + return exit_status; +} + +static void sstore_stage_global_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata) +{ + if( ORTE_RML_TAG_SSTORE_INTERNAL != tag ) { + return; + } + + ORTE_MESSAGE_EVENT(sender, buffer, tag, sstore_stage_global_process_cmd); + + return; +} + +static void sstore_stage_global_process_cmd(int fd, + short event, + void *cbdata) +{ + int ret; + orte_message_event_t *mev = (orte_message_event_t*)cbdata; + orte_process_name_t *sender = NULL; + orte_sstore_stage_cmd_flag_t command; + orte_std_cntr_t count; + orte_sstore_base_handle_t loc_id; + orte_sstore_stage_global_snapshot_info_t *handle_info = NULL; + + sender = &(mev->sender); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): process_cmd(%s)", + ORTE_NAME_PRINT(sender))); + + /* + * If this was an application process contacting us, then act like an orted + * instead of an HNP + */ + if(OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_JOBID, + ORTE_PROC_MY_NAME, + sender)) { + orte_sstore_stage_local_process_cmd(fd, event, cbdata); + return; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &command, &count, ORTE_SSTORE_STAGE_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &loc_id, &count, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + /* + * Find the referenced handle + */ + if(NULL == (handle_info = find_handle_info(loc_id)) ) { + ; /* JJH big problem */ + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): process_cmd(%s) - Command = %s", + ORTE_NAME_PRINT(sender), + (ORTE_SSTORE_STAGE_PULL == command ? "Pull" : + (ORTE_SSTORE_STAGE_PUSH == command ? "Push" : + (ORTE_SSTORE_STAGE_REMOVE == command ? "Remove" : + (ORTE_SSTORE_STAGE_DONE == command ? "Done" : "Unknown")))) )); + + /* + * Process the command + */ + if( ORTE_SSTORE_STAGE_PULL == command ) { + process_local_pull(sender, mev->buffer, handle_info); + } + else if( ORTE_SSTORE_STAGE_PUSH == command ) { + process_local_push(sender, mev->buffer, handle_info); + } + else if( ORTE_SSTORE_STAGE_REMOVE == command ) { + /* This is actually intended for the local coordinator */ + orte_sstore_stage_local_process_cmd_action(sender, command, loc_id, mev->buffer); + } + else if( ORTE_SSTORE_STAGE_DONE == command ) { + process_local_done(sender, mev->buffer, handle_info); + } + + cleanup: + return; +} + +static int process_local_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t loc_buffer; + orte_sstore_stage_cmd_flag_t command; + + /* + * Push back the requested information + */ + OBJ_CONSTRUCT(&loc_buffer, opal_buffer_t); + + command = ORTE_SSTORE_STAGE_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->seq_num), 1, OPAL_INT )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->ref_name), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->app_local_location_fmt), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( orte_sstore_stage_enabled_caching ) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->app_local_cache_location_fmt), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(handle_info->migrating), 1, OPAL_BOOL )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(peer, &loc_buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&loc_buffer); + + return exit_status; +} + +static int process_local_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count; + size_t num_entries, i; + orte_process_name_t name; + bool ckpt_skipped = false; + char * crs_comp = NULL; + char * compress_comp = NULL; + char * compress_postfix = NULL; + char * proc_name = NULL; + char * tmp_str = NULL; + orte_filem_base_request_t *filem_request = NULL; + orte_filem_base_process_set_t *p_set = NULL; + orte_filem_base_file_set_t * f_set = NULL; + + if( !orte_sstore_stage_skip_filem ) { + filem_request = OBJ_NEW(orte_filem_base_request_t); + /* + * Define the process set: + * Source (daemon) -> Sink (HNP) + */ + p_set = OBJ_NEW(orte_filem_base_process_set_t); + p_set->source.jobid = peer->jobid; + p_set->source.vpid = peer->vpid; + p_set->sink.jobid = ORTE_PROC_MY_NAME->jobid; + p_set->sink.vpid = ORTE_PROC_MY_NAME->vpid; + opal_list_append(&(filem_request->process_sets), &(p_set->super) ); + } + + /* + * Unpack the data + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &num_entries, &count, OPAL_SIZE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + for(i = 0; i < num_entries; ++i ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &name, &count, ORTE_NAME))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &ckpt_skipped, &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !ckpt_skipped ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &crs_comp, &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( orte_sstore_stage_enabled_compression ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &compress_comp, &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + if( NULL == handle_info->compress_comp ) { + handle_info->compress_comp = strdup(compress_comp); + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &compress_postfix, &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + if( NULL == handle_info->compress_postfix ) { + handle_info->compress_postfix = strdup(compress_postfix); + } + } + + if( !orte_sstore_stage_skip_filem ) { + /* + * Append to the file set for movement + */ + f_set = OBJ_NEW(orte_filem_base_file_set_t); + if( orte_sstore_stage_enabled_compression ) { + f_set->target_flag = ORTE_FILEM_TYPE_FILE; + } else { + f_set->target_flag = ORTE_FILEM_TYPE_DIR; + } + + if( orte_sstore_stage_enabled_compression ) { + asprintf(&tmp_str, + handle_info->app_global_location_fmt, + name.vpid); + asprintf(&(f_set->local_target), "%s%s", + tmp_str, + compress_postfix); + } else { + asprintf(&(f_set->local_target), + handle_info->app_global_location_fmt, + name.vpid); + } + + if( orte_sstore_stage_global_is_shared ) { + f_set->local_hint = ORTE_FILEM_HINT_SHARED; + } + + if( orte_sstore_stage_enabled_compression ) { + asprintf(&tmp_str, + handle_info->app_local_location_fmt, + name.vpid); + asprintf(&(f_set->remote_target), "%s%s", + tmp_str, + compress_postfix); + } else { + asprintf(&(f_set->remote_target), + handle_info->app_local_location_fmt, + name.vpid); + } + + opal_list_append(&(filem_request->file_sets), &(f_set->super) ); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): push(): Pulling remote file <%s> to <%s>", + f_set->remote_target, + f_set->local_target)); + } + + /* + * Write this information to the global metadata + */ + orte_util_convert_process_name_to_string(&proc_name, &name); + + metadata_write_str(handle_info, + SSTORE_METADATA_INTERNAL_PROCESS_STR, + proc_name); + metadata_write_str(handle_info, + SSTORE_METADATA_LOCAL_CRS_COMP_STR, + crs_comp); + if( orte_sstore_stage_enabled_compression ) { + metadata_write_str(handle_info, + SSTORE_METADATA_LOCAL_COMPRESS_COMP_STR, + compress_comp); + metadata_write_str(handle_info, + SSTORE_METADATA_LOCAL_COMPRESS_POSTFIX_STR, + compress_postfix); + } + } + + if( NULL != crs_comp ) { + free(crs_comp); + crs_comp = NULL; + } + if( NULL != compress_comp ) { + free(compress_comp); + compress_comp = NULL; + } + if( NULL != compress_postfix ) { + free(compress_postfix); + compress_postfix = NULL; + } + if( NULL != proc_name ) { + free(proc_name); + proc_name = NULL; + } + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + (handle_info->num_procs_synced)++; + } + + if( !orte_sstore_stage_skip_filem && 0 < opal_list_get_size(&(filem_request->file_sets)) ) { + /* + * Start to pull the files to global storage + */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): push(): Pulling remote files from %s (%3d of %3d done)", + ORTE_NAME_PRINT(peer), + handle_info->num_procs_synced, + handle_info->num_procs_total)); + opal_list_append(handle_info->filem_requests, &(filem_request->super)); + if(ORTE_SUCCESS != (ret = orte_filem.get_nb(filem_request)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + if( NULL != crs_comp ) { + free(crs_comp); + crs_comp = NULL; + } + if( NULL != compress_comp ) { + free(compress_comp); + compress_comp = NULL; + } + if( NULL != compress_postfix ) { + free(compress_postfix); + compress_postfix = NULL; + } + if( NULL != proc_name ) { + free(proc_name); + proc_name = NULL; + } + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + return exit_status; +} + +static int process_local_done(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count; + size_t num_entries; + + /* + * Unpack the data + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &num_entries, &count, OPAL_SIZE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + (handle_info->num_procs_done) += (int)num_entries; + + SSTORE_STAGE_REPORT_PROGRESS(handle_info); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): done(): [Peer %s] Moved %d files (%3d of %3d reported as done)", + ORTE_NAME_PRINT(peer), + (int)num_entries, + handle_info->num_procs_done, + handle_info->num_procs_total)); + + cleanup: + return exit_status; +} + +static int init_global_snapshot_directory(orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + char * dir_name = NULL; + mode_t my_mode = S_IRWXU; + + /* + * Make the snapshot directory from the uniq_global_snapshot_name + */ + asprintf(&dir_name, "%s/%s", + handle_info->base_location, + handle_info->local_location); + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(dir_name, my_mode)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Open up the metadata file + */ + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + if(NULL != dir_name) { + free(dir_name); + dir_name = NULL; + } + + return exit_status; +} + +static int wait_all_filem(orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_list_item_t* item = NULL; + + if( orte_sstore_stage_skip_filem ) { + return exit_status; + } + + /* + * Wait for all the transfers to complete + */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): wait_all_filem(): Waiting on all outstanding FileM requests (%d)", + (int)opal_list_get_size(handle_info->filem_requests) )); + + if(ORTE_SUCCESS != (ret = orte_filem.wait_all(handle_info->filem_requests)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Remove the data on the remote side + */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): wait_all_filem(): Removing all local files")); + if( ORTE_SUCCESS != (ret = xcast_remove_all(handle_info))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Touch all local checkpoints + */ + sync_global_dir(handle_info); + + /* + * Wait for the removal to complete + */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): wait_all_filem(): Waiting for remove to finish...")); + while(handle_info->num_procs_done < handle_info->num_procs_total) { + opal_progress(); + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(global): wait_all_filem(): All files have been transfered")); + + cleanup: + while (NULL != (item = opal_list_remove_first(handle_info->filem_requests) ) ) { + OBJ_RELEASE(item); + } + OBJ_DESTRUCT(handle_info->filem_requests); + + return exit_status; +} + +static int xcast_remove_all(orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t loc_buffer; + orte_sstore_stage_cmd_flag_t command; + + handle_info->num_procs_done = 0; + + OBJ_CONSTRUCT(&loc_buffer, opal_buffer_t); + + command = ORTE_SSTORE_STAGE_REMOVE; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( ORTE_SUCCESS != (ret = orte_grpcomm.xcast(ORTE_PROC_MY_NAME->jobid, &loc_buffer, ORTE_RML_TAG_SSTORE_INTERNAL)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&loc_buffer); + + return exit_status; +} + +/************************** + * Metadata functions + **************************/ +static int metadata_open(orte_sstore_stage_global_snapshot_info_t * handle_info) +{ + /* If already open, then just return */ + if( NULL != handle_info->metadata ) { + return ORTE_SUCCESS; + } + + if (NULL == (handle_info->metadata = fopen(handle_info->metadata_filename, "a")) ) { + opal_output(orte_sstore_base_output, + "sstore:stage:(global):init_dir() Unable to open the file (%s)\n", + handle_info->metadata_filename); + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + return ORTE_SUCCESS; +} + +static int metadata_close(orte_sstore_stage_global_snapshot_info_t * handle_info) +{ + /* If already closed, then just return */ + if( NULL == handle_info->metadata ) { + return ORTE_SUCCESS; + } + + fclose(handle_info->metadata); + handle_info->metadata = NULL; + + return ORTE_SUCCESS; +} + +static int metadata_write_int(orte_sstore_stage_global_snapshot_info_t * handle_info, char *key, int value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%d\n", key, value); + + cleanup: + return exit_status; +} + +static int metadata_write_str(orte_sstore_stage_global_snapshot_info_t * handle_info, char *key, char *value) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + fprintf(handle_info->metadata, "%s%s\n", key, value); + + cleanup: + return exit_status; +} + +static int metadata_write_timestamp(orte_sstore_stage_global_snapshot_info_t * handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + time_t timestamp; + + /* Make sure the metadata file is open */ + if( NULL == handle_info->metadata ) { + if( ORTE_SUCCESS != (ret = metadata_open(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + timestamp = time(NULL); + fprintf(handle_info->metadata, "%s%s", + SSTORE_METADATA_INTERNAL_TIME_STR, + ctime(×tamp)); + + cleanup: + return exit_status; +} + +static int orte_sstore_stage_extract_global_metadata(orte_sstore_stage_global_snapshot_info_t * handle_info, + orte_sstore_base_global_snapshot_info_t *global_snapshot) +{ + int exit_status = ORTE_SUCCESS; + orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL; + opal_list_item_t* item = NULL; + int i = 0; + + /* + * Cleanup the structure a bit, so we can refresh it below + */ + while (NULL != (item = opal_list_remove_first(&global_snapshot->local_snapshots))) { + OBJ_RELEASE(item); + } + + if( NULL != global_snapshot->start_time ) { + free( global_snapshot->start_time ); + global_snapshot->start_time = NULL; + } + + if( NULL != global_snapshot->end_time ) { + free( global_snapshot->end_time ); + global_snapshot->end_time = NULL; + } + + /* + * Create a structure for each application process + */ + for(i = 0; i < handle_info->num_procs_total; ++i) { + vpid_snapshot = OBJ_NEW(orte_sstore_base_local_snapshot_info_t); + vpid_snapshot->ss_handle = handle_info->id; + + vpid_snapshot->process_name.jobid = handle_info->jobid; + vpid_snapshot->process_name.vpid = i; + + /* JJH: Currently we do not have this information since we do not save + * individual vpid info in the Global SStore. It is in the metadata + * though. + */ + vpid_snapshot->crs_comp = NULL; + if( NULL != handle_info->compress_comp ) { + vpid_snapshot->compress_comp = strdup(handle_info->compress_comp); + } else { + vpid_snapshot->compress_comp = NULL; + } + if( NULL != handle_info->compress_postfix ) { + vpid_snapshot->compress_postfix = strdup(handle_info->compress_postfix); + } else { + vpid_snapshot->compress_postfix = NULL; + } + vpid_snapshot->start_time = NULL; + vpid_snapshot->end_time = NULL; + + opal_list_append(&global_snapshot->local_snapshots, &(vpid_snapshot->super)); + } + + return exit_status; +} + +static int stage_snapshot_sort_compare_fn(opal_list_item_t **a, + opal_list_item_t **b) +{ + orte_sstore_base_local_snapshot_info_t *snap_a, *snap_b; + + snap_a = (orte_sstore_base_local_snapshot_info_t*)(*a); + snap_b = (orte_sstore_base_local_snapshot_info_t*)(*b); + + if( snap_a->process_name.vpid > snap_b->process_name.vpid ) { + return 1; + } + else if( snap_a->process_name.vpid == snap_b->process_name.vpid ) { + return 0; + } + else { + return -1; + } +} + +static void sstore_stage_report_progress(orte_sstore_stage_global_snapshot_info_t *handle_info) +{ + double perc_done; + + perc_done = (handle_info->num_procs_total - handle_info->num_procs_done); + perc_done = perc_done / (1.0 * handle_info->num_procs_total); + perc_done = (perc_done-1)*(-100.0); + + if( perc_done >= (handle_info->last_progress_report + orte_sstore_stage_progress_meter ) || + handle_info->last_progress_report == 0.0 ) { + handle_info->last_progress_report = perc_done; + opal_output(0, "sstore:stage: progress: %10.2f %c Finished\n", + perc_done, '%'); + } + + return; +} diff --git a/orte/mca/sstore/stage/sstore_stage_local.c b/orte/mca/sstore/stage/sstore_stage_local.c new file mode 100644 index 0000000000..eb5c08233d --- /dev/null +++ b/orte/mca/sstore/stage/sstore_stage_local.c @@ -0,0 +1,2099 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "orte/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" +#include "opal/util/os_dirpath.h" + +#include "opal/mca/compress/compress.h" +#include "opal/mca/compress/base/base.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/odls/odls_types.h" +#include "orte/mca/filem/filem.h" +#include "orte/mca/filem/base/base.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_stage.h" + +/********** + * Object stuff + **********/ +#define SSTORE_LOCAL_NONE 0 +#define SSTORE_LOCAL_ERROR 1 +#define SSTORE_LOCAL_INIT 2 +#define SSTORE_LOCAL_READY 3 +#define SSTORE_LOCAL_SYNCED 4 +#define SSTORE_LOCAL_DONE 5 + +struct orte_sstore_stage_local_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** */ + orte_sstore_base_handle_t id; + + /** Status */ + int status; + + /** Sequence Number */ + int seq_num; + + /** Global Reference Name */ + char * global_ref_name; + + /** Local Location Format String */ + char * location_fmt; + + /** Local Cache Location Format String */ + char * cache_location_fmt; + + /* Application info handles*/ + opal_list_t *app_info_handle; + + /** Compress Component used */ + char * compress_comp; + + /** Compress Component postfix */ + char * compress_postfix; + + /** Is this checkpoint representing a migration? */ + bool migrating; +}; +typedef struct orte_sstore_stage_local_snapshot_info_t orte_sstore_stage_local_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_stage_local_snapshot_info_t); + +void orte_sstore_stage_local_snapshot_info_construct(orte_sstore_stage_local_snapshot_info_t *info); +void orte_sstore_stage_local_snapshot_info_destruct( orte_sstore_stage_local_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_stage_local_snapshot_info_t, + opal_list_item_t, + orte_sstore_stage_local_snapshot_info_construct, + orte_sstore_stage_local_snapshot_info_destruct); + +struct orte_sstore_stage_local_app_snapshot_info_t { + /** List super object */ + opal_list_item_t super; + + /** Process Name associated with this entry */ + orte_process_name_t name; + + /** Local Location (Absolute Path) */ + char * local_location; + + /** Compressed Local Location (Absolute Path) */ + char * compressed_local_location; + + /** Local Cache Location (Absolute Path) */ + char * local_cache_location; + + /** Metadata File Name (Absolute Path) */ + char * metadata_filename; + + /** CRS Component used */ + char * crs_comp; + + /** If this app. skipped the checkpoint - usually for non-migrating procs */ + bool ckpt_skipped; + + /** Compression PID to wait on */ + pid_t compress_pid; +}; +typedef struct orte_sstore_stage_local_app_snapshot_info_t orte_sstore_stage_local_app_snapshot_info_t; +ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_sstore_stage_local_app_snapshot_info_t); + +void orte_sstore_stage_local_app_snapshot_info_construct(orte_sstore_stage_local_app_snapshot_info_t *info); +void orte_sstore_stage_local_app_snapshot_info_destruct( orte_sstore_stage_local_app_snapshot_info_t *info); + +OBJ_CLASS_INSTANCE(orte_sstore_stage_local_app_snapshot_info_t, + opal_list_item_t, + orte_sstore_stage_local_app_snapshot_info_construct, + orte_sstore_stage_local_app_snapshot_info_destruct); + + + +/********** + * Local Function and Variable Declarations + **********/ +static bool is_global_listener_active = false; +static int sstore_stage_local_start_listener(void); +static int sstore_stage_local_stop_listener(void); +static void sstore_stage_local_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata); + +static int process_global_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info); +static int process_global_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info); +static int process_global_remove(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info); +static int process_app_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info); +static int process_app_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info); + +static orte_sstore_stage_local_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle); +static orte_sstore_stage_local_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle); +static orte_sstore_stage_local_snapshot_info_t *find_handle_info_ref(char * ref, int seq); + +static int append_new_app_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info, + orte_process_name_t *name); +static orte_sstore_stage_local_app_snapshot_info_t *find_app_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info, + orte_process_name_t *name); + +static int pull_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info ); +static int push_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info ); + +static int wait_all_apps_updated(orte_sstore_stage_local_snapshot_info_t *handle_info); + +static int start_compression(orte_sstore_stage_local_snapshot_info_t *handle_info, + orte_sstore_stage_local_app_snapshot_info_t *app_info); +static void sstore_stage_local_compress_waitpid_cb(pid_t pid, int status, void* cbdata); +static int wait_all_compressed(orte_sstore_stage_local_snapshot_info_t *handle_info); + +static int orte_sstore_stage_local_preload_files(char **local_location, bool *skip_xfer, + char *global_loc, char *ref, char *postfix, int seq); + +static int sstore_stage_create_local_dir(void); +static int sstore_stage_destroy_local_dir(void); + +static int sstore_stage_create_cache(void); +static int sstore_stage_update_cache(orte_sstore_stage_local_snapshot_info_t *handle_info); +static int sstore_stage_destroy_cache(void); + +static opal_list_t *active_handles = NULL; +static char * sstore_stage_local_basedir = NULL; + +static char * sstore_stage_cache_basedir = NULL; + +static char * sstore_stage_cache_current_dir = NULL; +static char * sstore_stage_cache_last_dir = NULL; + +static opal_list_t * preload_filem_requests = NULL; + +/********** + * Object stuff + **********/ +void orte_sstore_stage_local_snapshot_info_construct(orte_sstore_stage_local_snapshot_info_t *info) +{ + info->id = 0; + + info->status = SSTORE_LOCAL_NONE; + + info->seq_num = -1; + + info->global_ref_name = NULL; + + info->location_fmt = NULL; + + info->cache_location_fmt = NULL; + + info->app_info_handle = OBJ_NEW(opal_list_t); + + info->compress_comp = NULL; + + info->compress_postfix = NULL; + + info->migrating = false; +} + +void orte_sstore_stage_local_snapshot_info_destruct( orte_sstore_stage_local_snapshot_info_t *info) +{ + info->id = 0; + + info->status = SSTORE_LOCAL_NONE; + + info->seq_num = -1; + + if( NULL != info->global_ref_name ) { + free( info->global_ref_name ); + info->global_ref_name = NULL; + } + + if( NULL != info->location_fmt ) { + free( info->location_fmt ); + info->location_fmt = NULL; + } + + if( NULL != info->cache_location_fmt ) { + free( info->cache_location_fmt ); + info->cache_location_fmt = NULL; + } + + if( NULL != info->app_info_handle ) { + OBJ_RELEASE(info->app_info_handle); + info->app_info_handle = NULL; + } + + if( NULL != info->compress_comp ) { + free(info->compress_comp); + info->compress_comp = NULL; + } + + if( NULL != info->compress_postfix ) { + free(info->compress_postfix); + info->compress_postfix = NULL; + } + + info->migrating = false; +} + +void orte_sstore_stage_local_app_snapshot_info_construct(orte_sstore_stage_local_app_snapshot_info_t *info) +{ + info->name.jobid = ORTE_JOBID_INVALID; + info->name.vpid = ORTE_VPID_INVALID; + + info->local_location = NULL; + info->compressed_local_location = NULL; + info->local_cache_location = NULL; + info->metadata_filename = NULL; + info->crs_comp = NULL; + info->ckpt_skipped = false; + info->compress_pid = 0; +} + +void orte_sstore_stage_local_app_snapshot_info_destruct( orte_sstore_stage_local_app_snapshot_info_t *info) +{ + info->name.jobid = ORTE_JOBID_INVALID; + info->name.vpid = ORTE_VPID_INVALID; + + if( NULL != info->local_location ) { + free(info->local_location); + info->local_location = NULL; + } + + if( NULL != info->compressed_local_location ) { + free(info->compressed_local_location); + info->compressed_local_location = NULL; + } + + if( NULL != info->local_cache_location ) { + free(info->local_cache_location); + info->local_cache_location = NULL; + } + + if( NULL != info->metadata_filename ) { + free(info->metadata_filename); + info->metadata_filename = NULL; + } + + if( NULL != info->crs_comp ) { + free(info->crs_comp); + info->crs_comp = NULL; + } + + info->ckpt_skipped = false; + + info->compress_pid = 0; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_stage_local_module_init(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): init()")); + + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + if( NULL == preload_filem_requests ) { + preload_filem_requests = OBJ_NEW(opal_list_t); + } + + /* + * Create the local storage directory + */ + asprintf(&sstore_stage_local_basedir, "%s/%s/%s", + orte_sstore_stage_local_snapshot_dir, + ORTE_SSTORE_LOCAL_SNAPSHOT_DIR_NAME, + ORTE_SSTORE_LOCAL_SNAPSHOT_STAGE_DIR_NAME); + if( ORTE_SUCCESS != (ret = sstore_stage_create_local_dir()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Create the local cache + */ + if( orte_sstore_stage_enabled_caching ) { + asprintf(&sstore_stage_cache_basedir, "%s/%s/%s", + orte_sstore_stage_local_snapshot_dir, + ORTE_SSTORE_LOCAL_SNAPSHOT_DIR_NAME, + ORTE_SSTORE_LOCAL_SNAPSHOT_CACHE_DIR_NAME); + + if( ORTE_SUCCESS != (ret = sstore_stage_create_cache()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + /* + * Setup a listener for the HNP/Apps + * We could be the HNP, in which case the listener is already registered. + */ + if( !ORTE_PROC_IS_HNP ) { + if( ORTE_SUCCESS != (ret = sstore_stage_local_start_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_local_module_finalize(void) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + bool done = false; + int cur_time = 0, max_time = 120; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): finalize()")); + + /* + * Wait for all active transfers to finish + */ + if( !ORTE_PROC_IS_HNP ) { + done = false; + while( 0 < opal_list_get_size(active_handles) && !done ) { + done = true; + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_stage_local_snapshot_info_t*)item; + if( SSTORE_LOCAL_DONE != handle_info->status && + SSTORE_LOCAL_NONE != handle_info->status && + SSTORE_LOCAL_ERROR != handle_info->status ) { + done = false; + break; + } + } + if( done ) { + break; + } + else { + if( cur_time != 0 && cur_time % 30 == 0 ) { + opal_output(0, "---> Waiting for fin(): %3d / %3d\n", + cur_time, max_time); + } + + opal_progress(); + if( cur_time >= max_time ) { + break; + } else { + sleep(1); + } + cur_time++; + } + } + } + + if( NULL != active_handles ) { + OBJ_RELEASE(active_handles); + } + + if( NULL != preload_filem_requests ) { + OBJ_RELEASE(preload_filem_requests); + } + + /* + * Shutdown the listener for the HNP/Apps + * We could be the HNP, in which case the listener is already deregistered. + */ + if( !ORTE_PROC_IS_HNP ) { + if( ORTE_SUCCESS != (ret = sstore_stage_local_stop_listener()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + /* + * Destroy the local cache + */ + if( orte_sstore_stage_enabled_caching ) { + if( ORTE_SUCCESS != (ret = sstore_stage_destroy_cache()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + /* + * Destroy the local storage directory + */ + if( ORTE_SUCCESS != (ret = sstore_stage_destroy_local_dir()) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + if( orte_sstore_stage_enabled_caching ) { + if( NULL != sstore_stage_cache_basedir ) { + free(sstore_stage_cache_basedir); + sstore_stage_cache_basedir = NULL; + } + } + + if( NULL != sstore_stage_local_basedir ) { + free(sstore_stage_local_basedir); + sstore_stage_local_basedir = NULL; + } + + return exit_status; +} + +int orte_sstore_stage_local_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + opal_output(0, "sstore:stage:(local): request_checkpoint_handle() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_stage_local_register(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): register()")); + + /* + * Create a handle + */ + if( NULL == (handle_info = find_handle_info(handle)) ) { + handle_info = create_new_handle_info(handle); + } + + /* + * Get basic information from Global SStore + */ + if( ORTE_SUCCESS != (ret = pull_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Wait here until the pull request has been satisfied + */ + while(SSTORE_LOCAL_READY != handle_info->status && + SSTORE_LOCAL_ERROR != handle_info->status ) { + opal_progress(); + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_local_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + opal_output(0, "sstore:stage:(local): get_attr() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_stage_local_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + opal_output(0, "sstore:stage:(local): set_attr() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_stage_local_sync(orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): sync()")); + + /* + * Lookup the handle + */ + handle_info = find_handle_info(handle); + + /* + * Wait for all of the applications to update their metadata + */ + if( ORTE_SUCCESS != (ret = wait_all_apps_updated(handle_info))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Wait for compression to finish + */ + if( orte_sstore_stage_enabled_compression ) { + if( ORTE_SUCCESS != (ret = wait_all_compressed(handle_info))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + /* + * Push information to the Global coordinator + */ + if( ORTE_SUCCESS != (ret = push_handle_info(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + handle_info->status = SSTORE_LOCAL_SYNCED; + + cleanup: + return exit_status; +} + +int orte_sstore_stage_local_remove(orte_sstore_base_handle_t handle) +{ + opal_output(0, "sstore:stage:(local): remove() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +int orte_sstore_stage_local_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + int ret, exit_status = ORTE_SUCCESS; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): pack()")); + + /* + * Lookup the handle + */ + + + /* + * Pack the handle ID + */ + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &handle, 1, ORTE_SSTORE_HANDLE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Pack any metadata + */ + + cleanup: + return exit_status; +} + +int orte_sstore_stage_local_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + orte_std_cntr_t count; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): unpack()")); + + /* + * Unpack the handle id + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, handle, &count, ORTE_SSTORE_HANDLE))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Lookup the handle + */ + if( NULL == (handle_info = find_handle_info(*handle)) ) { + handle_info = create_new_handle_info(*handle); + } + + /* + * Unpack the metadata piggybacked on this message + */ + if( ORTE_SUCCESS != (ret = process_global_push(peer, buffer, handle_info))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +int orte_sstore_stage_local_fetch_app_deps(orte_app_context_t *app) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t *item; + char **sstore_args = NULL; + char * req_snap_loc = NULL; + char * req_snap_global_ref = NULL; + char * req_snap_ref = NULL; + char * req_snap_postfix = NULL; + char * local_location = NULL; + char * req_snap_compress = NULL; + char * compress_local_location = NULL; + char * compress_ref = NULL; + char * tmp_str = NULL; + int req_snap_seq = 0; + orte_odls_child_t *child = NULL; + int loc_argc = 0; + bool skip_xfer = false; + + if( !app->used_on_node || NULL == app->sstore_load ) { + OPAL_OUTPUT_VERBOSE((30, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): Not for this daemon (%s, %d, %s)", + app->idx, + (app->used_on_node ? "T" : "F"), + (int)app->num_procs, + app->sstore_load)); + /* Nothing to do */ + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): %s", + app->idx, app->sstore_load)); + + /* + * Extract the 'ref:seq' parameter + */ + sstore_args = opal_argv_split(app->sstore_load, ':'); + req_snap_loc = strdup(sstore_args[0]); + req_snap_global_ref = strdup(sstore_args[1]); + req_snap_ref = strdup(sstore_args[2]); + if( NULL == sstore_args[4] ) { /* Not compressed */ + req_snap_seq = atoi( sstore_args[3]); + } else { + req_snap_compress = strdup(sstore_args[3]); + req_snap_postfix = strdup(sstore_args[4]); + req_snap_seq = atoi( sstore_args[5]); + } + + handle_info = find_handle_info_ref(req_snap_global_ref, req_snap_seq); + if( NULL == handle_info ) { + /* No checkpoints known, just preload the checkpoint */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): No known checkpoint [%s, %d]", + app->idx, + req_snap_ref, + req_snap_seq)); + goto filem_preload; + } + + /* + * If caching enabled, then look to see if we have this snapshot cached + * Do not cache if migrating, since checkpoints taken while migrating are + * not guaranteed to be globally taken. + */ + if( orte_sstore_stage_enabled_caching && !handle_info->migrating ) { + /* + * Find the process + */ + for (item = opal_list_get_first(&orte_local_children); + item != opal_list_get_end(&orte_local_children); + item = opal_list_get_next(item)) { + child = (orte_odls_child_t*)item; + if( app->idx == child->app_idx ) { + /* + * Find the app snapshot ref + */ + app_info = find_app_handle_info(handle_info, child->name); + break; + } + } + + if( NULL == app_info ) { + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): No processes known for this app context", + app->idx)); + goto filem_preload; + } + + /* + * Do we have a cached version of this file? + */ + if( NULL != app_info->local_cache_location && + 0 == (ret = access(app_info->local_cache_location, F_OK)) ) { + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): Using local cache. (%s)", + app->idx, + app_info->local_cache_location)); + + opal_argv_append(&loc_argc, &(app->argv), "-c"); + opal_argv_append(&loc_argc, &(app->argv), app_info->local_cache_location); + goto cleanup; + } else { + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): No cache available for %s. (%s)", + app->idx, + ORTE_NAME_PRINT(&app_info->name), + app_info->local_cache_location)); + } + } + + filem_preload: + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): Fetch files from Central storage", + app->idx)); + + /* + * If we got here, then there is no cached directory, so just preload the + * files, update the argument set, and carry on. + */ + if( ORTE_SUCCESS != (ret = orte_sstore_stage_local_preload_files(&local_location, + &skip_xfer, + req_snap_loc, + req_snap_ref, + req_snap_postfix, + req_snap_seq)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + opal_argv_append(&loc_argc, &(app->argv), "-l"); + opal_argv_append(&loc_argc, &(app->argv), local_location); + + /* + * Decompress files: + * opal-restart will do this for us on launch + */ + if( !skip_xfer ) { + if( NULL != req_snap_compress && 0 < strlen(req_snap_compress) ) { + opal_argv_append(&loc_argc, &(app->argv), "-d"); + opal_argv_append(&loc_argc, &(app->argv), req_snap_compress); + } + if( NULL != req_snap_postfix && 0 < strlen(req_snap_postfix) ) { + opal_argv_append(&loc_argc, &(app->argv), "-p"); + opal_argv_append(&loc_argc, &(app->argv), req_snap_postfix); + } + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): fetch_app_deps(%3d): Fetching to (%s)", + app->idx, + local_location)); + + cleanup: + if( NULL != req_snap_compress ) { + free(req_snap_compress); + req_snap_compress = NULL; + } + + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + if( NULL != compress_local_location ) { + free(compress_local_location); + compress_local_location = NULL; + } + + if( NULL != compress_ref ) { + free(compress_ref); + compress_ref = NULL; + } + + if( NULL != sstore_args ) { + opal_argv_free(sstore_args); + sstore_args = NULL; + } + + if( NULL != req_snap_ref ) { + free(req_snap_ref); + req_snap_ref = NULL; + } + + if( NULL != req_snap_postfix ) { + free(req_snap_postfix); + req_snap_postfix = NULL; + } + + if( NULL != req_snap_loc ) { + free(req_snap_loc); + req_snap_loc = NULL; + } + + if( NULL != req_snap_global_ref ) { + free(req_snap_global_ref); + req_snap_global_ref = NULL; + } + + return exit_status; +} + +int orte_sstore_stage_local_wait_all_deps(void) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_list_item_t* item = NULL; + + /* Nothing being preloaded, so just move on */ + if( 0 >= opal_list_get_size(preload_filem_requests) ) { + return ORTE_SUCCESS; + } + + /* + * Wait for all files to move + */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): wait_all_deps(): Waiting on %d requests", + (int)opal_list_get_size(preload_filem_requests))); + + if(ORTE_SUCCESS != (ret = orte_filem.wait_all(preload_filem_requests)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * Cache the restart files locally, so we can restart faster next time + * JJH: We already check the restart directory for a local copy before + * starting the transfer. So this feels unnecessary since the + * restart directory is always used as a cache, whether or not + * caching is enabled. The extra copy to the cache directory + * does not buy us anything. + */ + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): wait_all_deps(): Finished waiting on %d requests!", + (int)opal_list_get_size(preload_filem_requests))); + + cleanup: + while (NULL != (item = opal_list_remove_first(preload_filem_requests) ) ) { + OBJ_RELEASE(item); + } + + return exit_status; +} + +/************************** + * Local functions + **************************/ +static orte_sstore_stage_local_snapshot_info_t *create_new_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + opal_list_item_t *item = NULL; + orte_odls_child_t *child = NULL; + + if( NULL == active_handles ) { + active_handles = OBJ_NEW(opal_list_t); + } + + handle_info = OBJ_NEW(orte_sstore_stage_local_snapshot_info_t); + + handle_info->id = handle; + + opal_list_append(active_handles, &(handle_info->super)); + + /* + * Create a sub structure for each child + */ + for (item = opal_list_get_first(&orte_local_children); + item != opal_list_get_end(&orte_local_children); + item = opal_list_get_next(item)) { + child = (orte_odls_child_t*)item; + append_new_app_handle_info(handle_info, child->name); + } + + handle_info->status = SSTORE_LOCAL_INIT; + + return handle_info; +} + +static orte_sstore_stage_local_snapshot_info_t *find_handle_info(orte_sstore_base_handle_t handle) +{ + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + if( NULL == active_handles ) { + return NULL; + } + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_stage_local_snapshot_info_t*)item; + + if( handle_info->id == handle ) { + return handle_info; + } + } + + return NULL; +} + +static orte_sstore_stage_local_snapshot_info_t *find_handle_info_ref(char * ref, int seq) +{ + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + opal_list_item_t* item = NULL; + + if( NULL == active_handles ) { + return NULL; + } + + for(item = opal_list_get_first(active_handles); + item != opal_list_get_end(active_handles); + item = opal_list_get_next(item) ) { + handle_info = (orte_sstore_stage_local_snapshot_info_t*)item; + + if( 0 == strncmp(handle_info->global_ref_name, ref, strlen(ref)) && + handle_info->seq_num == seq ) { + return handle_info; + } + } + + return NULL; +} + +static int append_new_app_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info, + orte_process_name_t *name) +{ + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + + app_info = OBJ_NEW(orte_sstore_stage_local_app_snapshot_info_t); + + app_info->name.jobid = name->jobid; + app_info->name.vpid = name->vpid; + + opal_list_append(handle_info->app_info_handle, &(app_info->super)); + + return ORTE_SUCCESS; +} + +static orte_sstore_stage_local_app_snapshot_info_t *find_app_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info, + orte_process_name_t *name) +{ + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t* item = NULL; + + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + if( app_info->name.jobid == name->jobid && + app_info->name.vpid == name->vpid ) { + return app_info; + } + } + + return NULL; +} + +static int sstore_stage_local_start_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if( is_global_listener_active ) { + return ORTE_SUCCESS; + } + + if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL, + ORTE_RML_PERSISTENT, + sstore_stage_local_recv, + NULL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = true; + + cleanup: + return exit_status; +} + +static int sstore_stage_local_stop_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_SSTORE_INTERNAL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + is_global_listener_active = false; + + cleanup: + return exit_status; +} + +static void sstore_stage_local_recv(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata) +{ + if( ORTE_RML_TAG_SSTORE_INTERNAL != tag ) { + return; + } + + ORTE_MESSAGE_EVENT(sender, buffer, tag, orte_sstore_stage_local_process_cmd); + + return; +} + +void orte_sstore_stage_local_process_cmd(int fd, + short event, + void *cbdata) +{ + int ret; + orte_message_event_t *mev = (orte_message_event_t*)cbdata; + orte_process_name_t *sender = NULL; + orte_sstore_stage_cmd_flag_t command; + orte_std_cntr_t count; + orte_sstore_base_handle_t loc_id; + + sender = &(mev->sender); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): process_cmd(%s)", + ORTE_NAME_PRINT(sender))); + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &command, &count, ORTE_SSTORE_STAGE_CMD))) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &loc_id, &count, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + goto cleanup; + } + + orte_sstore_stage_local_process_cmd_action(sender, command, loc_id, mev->buffer); + + cleanup: + return; +} + +int orte_sstore_stage_local_process_cmd_action(orte_process_name_t *sender, + orte_sstore_stage_cmd_flag_t command, + orte_sstore_base_handle_t loc_id, + opal_buffer_t* buffer) +{ + orte_sstore_stage_local_snapshot_info_t *handle_info = NULL; + + /* + * Find the referenced handle (Create if it does not exist) + */ + if(NULL == (handle_info = find_handle_info(loc_id)) ) { + handle_info = create_new_handle_info(loc_id); + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): process_cmd(%s) - Command = %s", + ORTE_NAME_PRINT(sender), + (ORTE_SSTORE_STAGE_PULL == command ? "Pull" : + (ORTE_SSTORE_STAGE_PUSH == command ? "Push" : + (ORTE_SSTORE_STAGE_REMOVE == command ? "Remove" : + (ORTE_SSTORE_STAGE_DONE == command ? "Done" : "Unknown")))) )); + + /* + * Process the command + */ + if( ORTE_SSTORE_STAGE_PULL == command ) { + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_HNP, sender)) { + process_global_pull(sender, buffer, handle_info); + } else { + process_app_pull(sender, buffer, handle_info); + } + } + else if( ORTE_SSTORE_STAGE_PUSH == command ) { + if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_HNP, sender)) { + process_global_push(sender, buffer, handle_info); + } else { + process_app_push(sender, buffer, handle_info); + } + } + else if( ORTE_SSTORE_STAGE_REMOVE == command ) { + /* The xcast from the root makes the 'sender' equal to this process :/ + * so we know it is the HNP, so just use that name */ + process_global_remove(ORTE_PROC_MY_HNP, buffer, handle_info); + } + + return ORTE_SUCCESS; +} + +static int process_global_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + /* JJH should be as simple as calling push_handle_info() */ + opal_output(0, "sstore:stage:(local): process_global_pull() Not implemented!"); + return ORTE_ERR_NOT_IMPLEMENTED; +} + +static int process_global_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t* item = NULL; + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->seq_num), &count, OPAL_INT))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->global_ref_name), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->location_fmt), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( orte_sstore_stage_enabled_caching ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->cache_location_fmt), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(handle_info->migrating), &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * For each process we are working with + */ + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + if( NULL != app_info->local_location ) { + free(app_info->local_location); + app_info->local_location = NULL; + } + asprintf(&(app_info->local_location), handle_info->location_fmt, app_info->name.vpid); + + if( orte_sstore_stage_enabled_caching ) { + if( NULL != app_info->local_cache_location ) { + free(app_info->local_cache_location); + app_info->local_cache_location = NULL; + } + asprintf(&(app_info->local_cache_location), handle_info->cache_location_fmt, app_info->name.vpid); + } + + if( NULL != app_info->metadata_filename ) { + free(app_info->metadata_filename); + app_info->metadata_filename = NULL; + } + asprintf(&(app_info->metadata_filename), "%s/%s", + app_info->local_location, + orte_sstore_base_local_metadata_filename); + } + + cleanup: + if( ORTE_SUCCESS == exit_status ) { + handle_info->status = SSTORE_LOCAL_READY; + } else { + handle_info->status = SSTORE_LOCAL_ERROR; + } + + return exit_status; +} + +static int process_global_remove(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t* item = NULL; + opal_buffer_t loc_buffer; + orte_sstore_stage_cmd_flag_t command; + size_t list_size; + char * cmd = NULL; + + /* + * If not caching, then just remove the local copy + * Or if migrating, since we do not cache checkpoints generated while + * migrating. + */ + if( !orte_sstore_stage_enabled_caching || handle_info->migrating ) { + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + asprintf(&cmd, "rm -rf %s", app_info->local_location); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): update_cache(): Removing with command (%s)", + cmd)); + system(cmd); + + if( orte_sstore_stage_enabled_compression && NULL != app_info->compressed_local_location) { + free(cmd); + cmd = NULL; + + asprintf(&cmd, "rm -rf %s", app_info->compressed_local_location); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): update_cache(): Removing with command (%s)", + cmd)); + system(cmd); + } + } + } + else { + /* + * Update the local cache + */ + if( ORTE_SUCCESS != (ret = sstore_stage_update_cache(handle_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + OBJ_CONSTRUCT(&loc_buffer, opal_buffer_t); + + command = ORTE_SSTORE_STAGE_DONE; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + list_size = opal_list_get_size(handle_info->app_info_handle); + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &list_size, 1, OPAL_SIZE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(peer, &loc_buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): remove(): Sent done for %d files to %s", + (int)list_size, + ORTE_NAME_PRINT(peer))); + + handle_info->status = SSTORE_LOCAL_DONE; + + cleanup: + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + + OBJ_DESTRUCT(&loc_buffer); + + return exit_status; +} + +static int process_app_pull(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t loc_buffer; + orte_sstore_stage_cmd_flag_t command; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + + /* + * Find this app's data + */ + app_info = find_app_handle_info(handle_info, peer); + + /* + * Push back the requested information + */ + OBJ_CONSTRUCT(&loc_buffer, opal_buffer_t); + + command = ORTE_SSTORE_STAGE_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->seq_num), 1, OPAL_INT )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(handle_info->global_ref_name), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(app_info->local_location), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&loc_buffer, &(app_info->metadata_filename), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(peer, &loc_buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&loc_buffer); + + return exit_status; +} + +static int process_app_push(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + + /* + * Find this app's data + */ + app_info = find_app_handle_info(handle_info, peer); + + /* + * Unpack the data + */ + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(app_info->ckpt_skipped), &count, OPAL_BOOL))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !app_info->ckpt_skipped ) { + count = 1; + if (ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &(app_info->crs_comp), &count, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): app_push(%s, skip=%s, %s)", + ORTE_NAME_PRINT(&(app_info->name)), + (app_info->ckpt_skipped ? "T" : "F"), + app_info->crs_comp)); + + /* Compression started on sync() */ + + cleanup: + return exit_status; +} + +static int wait_all_apps_updated(orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t *item = NULL; + bool is_done = true; + + do { + is_done = true; + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + if( NULL == app_info->crs_comp && !app_info->ckpt_skipped ) { + is_done = false; + break; + } + } + + if( !is_done ) { + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): Waiting for appliccation %s", + ORTE_NAME_PRINT(&(app_info->name)) )); + opal_progress(); + } + } while(!is_done); + + return ORTE_SUCCESS; +} + +static int start_compression(orte_sstore_stage_local_snapshot_info_t *handle_info, + orte_sstore_stage_local_app_snapshot_info_t *app_info) +{ + int ret, exit_status = ORTE_SUCCESS; + char * postfix = NULL; + + /* Sanity Check */ + if( !orte_sstore_stage_enabled_compression ) { + goto cleanup; + } + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): start_compression() Starting compression for process %s of (%s)", + ORTE_NAME_PRINT(&(app_info->name)), + app_info->local_location )); + + /* + * Start compression (nonblocking) + */ + if( ORTE_SUCCESS != (ret = opal_compress.compress_nb(app_info->local_location, + &(app_info->compressed_local_location), + &(postfix), + &(app_info->compress_pid))) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + if( app_info->compress_pid <= 0 ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if( NULL == handle_info->compress_comp ) { + handle_info->compress_comp = strdup(opal_compress_base_selected_component.base_version.mca_component_name); + handle_info->compress_postfix = strdup(postfix); + } + + /* + * Setup a callback for when it is finished + */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): start_compression() Waiting for compression (%d) for process %s", + app_info->compress_pid, + ORTE_NAME_PRINT(&(app_info->name)) )); + + if( ORTE_SUCCESS != (ret = orte_wait_cb(app_info->compress_pid, sstore_stage_local_compress_waitpid_cb, app_info) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + if( NULL != postfix ) { + free(postfix); + postfix = NULL; + } + + return exit_status; +} + +static void sstore_stage_local_compress_waitpid_cb(pid_t pid, int status, void* cbdata) +{ + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)cbdata; + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): waitpid(%6d) Compression finished for Process %s", + (int)pid, + ORTE_NAME_PRINT(&(app_info->name)) )); + + app_info->compress_pid = 0; +} + +static int wait_all_compressed(orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t *item = NULL; + bool is_done = true; + int usleep_time = 1000; + int s_time = 0, max_wait_time; + + /* Sanity Check */ + if( !orte_sstore_stage_enabled_compression ) { + return ORTE_SUCCESS; + } + + /* + * Start all compression + */ + if( orte_sstore_stage_compress_delay > 0 ) { + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): Delaying %d second before starting compression...", + orte_sstore_stage_compress_delay)); + max_wait_time = orte_sstore_stage_compress_delay * (1000000/usleep_time); + for( s_time = 0; s_time < max_wait_time; ++s_time) { + opal_progress(); + usleep(1000); + } + } + + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + if( ORTE_SUCCESS != (ret = start_compression(handle_info, app_info)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + + /* + * Wait for compression to finish + */ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): Waiting for compression to finish...")); + do { + is_done = true; + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + if( 0 < app_info->compress_pid ) { + is_done = false; + break; + } + } + + if( !is_done ) { + OPAL_OUTPUT_VERBOSE((30, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): Waiting for compression to finish for appliccation %s", + ORTE_NAME_PRINT(&(app_info->name)) )); + opal_progress(); + } + } while(!is_done); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): Compression finished!")); + cleanup: + return exit_status; +} + +static int pull_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_stage_cmd_flag_t command; + + /* + * Check to see if this is necessary + * (Did we get all of the info from the handle unpack?) + */ + if( 0 <= handle_info->seq_num && + NULL != handle_info->global_ref_name && + NULL != handle_info->location_fmt ) { + handle_info->status = SSTORE_LOCAL_READY; + return ORTE_SUCCESS; + } + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + /* + * Ask the daemon to send us the info that we need + */ + command = ORTE_SSTORE_STAGE_PULL; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + +static int push_handle_info(orte_sstore_stage_local_snapshot_info_t *handle_info ) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t buffer; + orte_sstore_stage_cmd_flag_t command; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t *item = NULL; + size_t list_size; + + OBJ_CONSTRUCT(&buffer, opal_buffer_t); + + command = ORTE_SSTORE_STAGE_PUSH; + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &command, 1, ORTE_SSTORE_STAGE_CMD )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->id), 1, ORTE_SSTORE_HANDLE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + list_size = opal_list_get_size(handle_info->app_info_handle); + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &list_size, 1, OPAL_SIZE )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * For each process we are working with + */ + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(app_info->name), 1, ORTE_NAME )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(app_info->ckpt_skipped), 1, OPAL_BOOL )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( !app_info->ckpt_skipped ) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(app_info->crs_comp), 1, OPAL_STRING )) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + if( orte_sstore_stage_enabled_compression ) { + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->compress_comp), 1, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + if (ORTE_SUCCESS != (ret = opal_dss.pack(&buffer, &(handle_info->compress_postfix), 1, OPAL_STRING))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + } + } + } + + if (0 > (ret = orte_rml.send_buffer(ORTE_PROC_MY_HNP, &buffer, ORTE_RML_TAG_SSTORE_INTERNAL, 0))) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + OBJ_DESTRUCT(&buffer); + + return exit_status; +} + +static int sstore_stage_create_local_dir(void) +{ + int ret, exit_status = ORTE_SUCCESS; + mode_t my_mode = S_IRWXU; + + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(sstore_stage_local_basedir, my_mode)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +static int sstore_stage_destroy_local_dir(void) +{ + int ret, exit_status = ORTE_SUCCESS; + char * basedir_root = NULL; + + asprintf(&basedir_root, "%s/%s", + orte_sstore_stage_local_snapshot_dir, + ORTE_SSTORE_LOCAL_SNAPSHOT_DIR_NAME); + + if(OPAL_SUCCESS != (ret = opal_os_dirpath_destroy(basedir_root, true, NULL)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + if( NULL != basedir_root ) { + free(basedir_root); + basedir_root = NULL; + } + + return exit_status; +} + +static int sstore_stage_create_cache(void) +{ + int ret, exit_status = ORTE_SUCCESS; + mode_t my_mode = S_IRWXU; + + /* Sanity check */ + if( !orte_sstore_stage_enabled_caching ) { + goto cleanup; + } + + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(sstore_stage_cache_basedir, my_mode)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +static int sstore_stage_destroy_cache(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + /* Sanity check */ + if( !orte_sstore_stage_enabled_caching ) { + goto cleanup; + } + + if(OPAL_SUCCESS != (ret = opal_os_dirpath_destroy(sstore_stage_cache_basedir, true, NULL)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +static int sstore_stage_update_cache(orte_sstore_stage_local_snapshot_info_t *handle_info) +{ + int ret, exit_status = ORTE_SUCCESS; + char *cmd = NULL; + mode_t my_mode = S_IRWXU; + char *cache_dirname = NULL; + orte_sstore_stage_local_app_snapshot_info_t *app_info = NULL; + opal_list_item_t* item = NULL; + size_t list_size; + + /* Sanity Check */ + if( !orte_sstore_stage_enabled_caching || handle_info->migrating) { + goto cleanup; + } + + list_size = opal_list_get_size(handle_info->app_info_handle); + if( 0 >= list_size ) { + /* No processes on this node, skip */ + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)opal_list_get_first(handle_info->app_info_handle); + if( NULL == app_info ) { + ORTE_ERROR_LOG(ORTE_ERROR); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * Create the base cache directory + */ + cache_dirname = opal_dirname(app_info->local_cache_location); + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(cache_dirname, my_mode)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * For each process, move the current checkpoint to the cache directory + * Cached snapshots are always stored uncompressed. + */ + for(item = opal_list_get_first(handle_info->app_info_handle); + item != opal_list_get_end(handle_info->app_info_handle); + item = opal_list_get_next(item) ) { + app_info = (orte_sstore_stage_local_app_snapshot_info_t*)item; + + asprintf(&cmd, "mv %s %s", app_info->local_location, cache_dirname); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): update_cache(): Caching snapshot for process %s [%s]", + ORTE_NAME_PRINT(&app_info->name), + cmd)); + system(cmd); + + /* (JJH) Remove the cached files */ + if( orte_sstore_stage_enabled_compression && NULL != app_info->compressed_local_location) { + free(cmd); + cmd = NULL; + + asprintf(&cmd, "rm -rf %s", app_info->compressed_local_location); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): update_cache(): Removing with command (%s)", + cmd)); + system(cmd); + } + } + + /* + * Remove the previous cached checkpoint + */ + if( NULL != sstore_stage_cache_last_dir ) { + asprintf(&cmd, "rm -rf %s", sstore_stage_cache_last_dir); + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): update_cache(): Removing old cache dir command (%s)", + sstore_stage_cache_last_dir)); + system(cmd); + } + + /* + * Update 'last' cache pointer + */ + if( NULL != sstore_stage_cache_last_dir ) { + free(sstore_stage_cache_last_dir); + sstore_stage_cache_last_dir = NULL; + } + if( NULL != sstore_stage_cache_current_dir ) { + sstore_stage_cache_last_dir = strdup(sstore_stage_cache_current_dir); + } + + /* + * Update 'current' cache pointer + */ + if( NULL != sstore_stage_cache_current_dir ) { + free(sstore_stage_cache_current_dir); + sstore_stage_cache_current_dir = NULL; + } + sstore_stage_cache_current_dir = strdup(cache_dirname); + + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): update_cache(): Cache Pointers cur(%s), last(%s)", + sstore_stage_cache_current_dir, sstore_stage_cache_last_dir)); + + cleanup: + if( NULL != cmd ) { + free(cmd); + cmd = NULL; + } + + return exit_status; +} + +static int orte_sstore_stage_local_preload_files(char **local_location, bool *skip_xfer, + char *global_loc, char *ref, char *postfix, int seq) +{ + int ret, exit_status = ORTE_SUCCESS; + mode_t my_mode = S_IRWXU; + orte_filem_base_request_t *filem_request; + orte_filem_base_process_set_t *p_set = NULL; + orte_filem_base_file_set_t * f_set = NULL; + char * full_local_location = NULL; + + *skip_xfer = false; + + if( NULL != *local_location) { + free(*local_location); + *local_location = NULL; + } + + /* + * If the global directory is shared, then just reference directly + * + * Skip this optimization if compressing. Since decompressing on the + * central storage would typically require a transfer to the local + * disk to decompress, then transfer back. Eliminating all benefits + * of the optimization. + */ + /* (JJH) If we are going to use the preloaded restart files for subsequent + * restarts then we actually always want to preload the files. This + * way if we need to restart from the same checkpoint again, then + * we can from the local restart cache. + */ +#if 0 + if( orte_sstore_stage_global_is_shared && + (NULL == postfix || 0 >= strlen(postfix) ) ) { + *local_location = strdup(global_loc); + *skip_xfer = true; + goto cleanup; + } +#endif + + asprintf(local_location, "%s/%s/%s/%d", + orte_sstore_stage_local_snapshot_dir, + ORTE_SSTORE_LOCAL_SNAPSHOT_DIR_NAME, + ORTE_SSTORE_LOCAL_SNAPSHOT_RESTART_DIR_NAME, + seq); + asprintf(&full_local_location, "%s/%s", + *local_location, + ref); + + /* + * If the snapshot already exists locally, just reuse it instead of + * transfering it again. + */ + if( 0 == (ret = access(full_local_location, F_OK)) ) { + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage:(local): preload_files() Local snapshot already exists, reuse it (%s)", + full_local_location)); + *skip_xfer = true; + goto cleanup; + } + + /* + * Create the local restart directory + */ + if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(*local_location, my_mode)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* + * FileM request to move the checkpoint to the local directory + */ + filem_request = OBJ_NEW(orte_filem_base_request_t); + + /* Define the process set */ + p_set = OBJ_NEW(orte_filem_base_process_set_t); + if( ORTE_PROC_IS_HNP ) { + /* if I am the HNP, then use me as the source */ + p_set->source.jobid = ORTE_PROC_MY_NAME->jobid; + p_set->source.vpid = ORTE_PROC_MY_NAME->vpid; + } + else { + /* otherwise, set the HNP as the source */ + p_set->source.jobid = ORTE_PROC_MY_HNP->jobid; + p_set->source.vpid = ORTE_PROC_MY_HNP->vpid; + } + p_set->sink.jobid = ORTE_PROC_MY_NAME->jobid; + p_set->sink.vpid = ORTE_PROC_MY_NAME->vpid; + opal_list_append(&(filem_request->process_sets), &(p_set->super) ); + + /* Define the file set */ + f_set = OBJ_NEW(orte_filem_base_file_set_t); + + f_set->local_target = strdup(*local_location); + if( NULL != postfix && 0 < strlen(postfix) ) { + asprintf(&(f_set->remote_target), "%s/%s%s", + global_loc, + ref, + postfix); + } else { + asprintf(&(f_set->remote_target), "%s/%s", + global_loc, + ref); + } + if( NULL != postfix && 0 < strlen(postfix) ) { + f_set->target_flag = ORTE_FILEM_TYPE_FILE; + } else { + f_set->target_flag = ORTE_FILEM_TYPE_DIR; + } + + opal_list_append(&(filem_request->file_sets), &(f_set->super) ); + + /* Start getting the files */ + opal_list_append(preload_filem_requests, &(filem_request->super)); + if(ORTE_SUCCESS != (ret = orte_filem.get_nb(filem_request)) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + cleanup: + if( NULL != full_local_location ) { + free(full_local_location); + full_local_location = NULL; + } + + return exit_status; +} diff --git a/orte/mca/sstore/stage/sstore_stage_module.c b/orte/mca/sstore/stage/sstore_stage_module.c new file mode 100644 index 0000000000..c8e2e579c0 --- /dev/null +++ b/orte/mca/sstore/stage/sstore_stage_module.c @@ -0,0 +1,373 @@ +/* + * Copyright (c) 2010 The Trustees of Indiana University. + * All rights reserved. + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/* + * + */ + +#include "orte_config.h" + +#ifdef HAVE_STRING_H +#include +#endif +#include +#include +#include +#include +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ + +#include "opal/mca/mca.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" + +#include "opal/event/event.h" + +#include "orte/constants.h" +#include "orte/util/show_help.h" +#include "opal/util/argv.h" +#include "opal/util/output.h" +#include "opal/util/opal_environ.h" +#include "opal/util/basename.h" + +#include "opal/threads/mutex.h" +#include "opal/threads/condition.h" + +#include "orte/util/name_fns.h" +#include "orte/util/proc_info.h" +#include "orte/runtime/orte_globals.h" +#include "orte/runtime/orte_wait.h" +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/snapc/snapc.h" + +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" + +#include "sstore_stage.h" + +/********** + * Local Function and Variable Declarations + **********/ + +/* + * stage module + */ +static orte_sstore_base_module_t loc_module = { + /** Initialization Function */ + orte_sstore_stage_module_init, + /** Finalization Function */ + orte_sstore_stage_module_finalize, + + orte_sstore_stage_request_checkpoint_handle, + orte_sstore_stage_request_restart_handle, + orte_sstore_stage_request_global_snapshot_data, + orte_sstore_stage_register, + orte_sstore_stage_get_attr, + orte_sstore_stage_set_attr, + orte_sstore_stage_sync, + orte_sstore_stage_remove, + + orte_sstore_stage_pack, + orte_sstore_stage_unpack, + orte_sstore_stage_fetch_app_deps, + orte_sstore_stage_wait_all_deps +}; + +/* + * MCA Functions + */ +int orte_sstore_stage_component_query(mca_base_module_t **module, int *priority) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage: component_query()")); + + /* + * If the user failed to specify a directory, then skip this component + */ + if( NULL != orte_sstore_stage_local_snapshot_dir && + 0 < strlen(orte_sstore_stage_local_snapshot_dir) ) { + *priority = mca_sstore_stage_component.super.priority; + *module = (mca_base_module_t *)&loc_module; + } else { + *priority = -1; + *module = NULL; + } + + return ORTE_SUCCESS; +} + +int orte_sstore_stage_module_init(void) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage: module_init()")); + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_module_init(); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_module_init(); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_module_init(); + } + + return ORTE_SUCCESS; +} + +int orte_sstore_stage_module_finalize(void) +{ + OPAL_OUTPUT_VERBOSE((10, mca_sstore_stage_component.super.output_handle, + "sstore:stage: module_finalize()")); + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_module_finalize(); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_module_finalize(); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_module_finalize(); + } + + return ORTE_SUCCESS; +} + +/****************** + * Local functions + ******************/ +int orte_sstore_stage_request_checkpoint_handle(orte_sstore_base_handle_t *handle, int seq, orte_jobid_t jobid) +{ + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + opal_output(0, "sstore:stage:(tool): request_checkpoint_handle() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_request_checkpoint_handle(handle, seq, jobid); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_request_checkpoint_handle(handle, seq, jobid); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_request_checkpoint_handle(handle, seq, jobid); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_request_restart_handle(orte_sstore_base_handle_t *handle, char *basedir, char *ref, int seq, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + return orte_sstore_base_tool_request_restart_handle(handle, basedir, ref, seq, snapshot); + } + else if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + opal_output(0, "sstore:stage:(global): request_restart_handle() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + opal_output(0, "sstore:stage:(local): request_restart_handle() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + opal_output(0, "sstore:stage:(app): request_restart_handle() Not supported!"); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_request_global_snapshot_data(orte_sstore_base_handle_t *handle, + orte_sstore_base_global_snapshot_info_t *snapshot) +{ + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + opal_output(0, "sstore:stage:(tool): request_global_snapshot_data() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_request_global_snapshot_data(handle, snapshot); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + opal_output(0, "sstore:stage:(local): request_global_snapshot_data() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + opal_output(0, "sstore:stage:(app): request_global_snapshot_data() Not supported!"); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_register(orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_register(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_register(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_register(handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_get_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char **value) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_TOOL_TYPE ) { + return orte_sstore_base_tool_get_attr(handle, key, value); + } + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_get_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_get_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_get_attr(handle, key, value); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_set_attr(orte_sstore_base_handle_t handle, orte_sstore_base_key_t key, char *value) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_set_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_set_attr(handle, key, value); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_set_attr(handle, key, value); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_sync(orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_sync(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_sync(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_sync(handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_remove(orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_remove(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_remove(handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_remove(handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_pack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t handle) +{ + if( ORTE_SSTORE_HANDLE_INVALID == handle ) { + ORTE_ERROR_LOG(ORTE_ERROR); + return ORTE_ERROR; + } + + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_pack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_pack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_pack(peer, buffer, handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_unpack(orte_process_name_t* peer, opal_buffer_t* buffer, orte_sstore_base_handle_t *handle) +{ + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + return orte_sstore_stage_global_unpack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_unpack(peer, buffer, handle); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + return orte_sstore_stage_app_unpack(peer, buffer, handle); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_fetch_app_deps(orte_app_context_t *app) +{ + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + opal_output(0, "sstore:stage:(Global): fetch_app_deps() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_fetch_app_deps(app); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + opal_output(0, "sstore:stage:(App): fetch_app_deps() Not supported!"); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +int orte_sstore_stage_wait_all_deps(void) +{ + if( orte_sstore_context & ORTE_SSTORE_GLOBAL_TYPE ) { + opal_output(0, "sstore:stage:(Global): wait_all_deps() Not supported!"); + } + else if( orte_sstore_context & ORTE_SSTORE_LOCAL_TYPE ) { + return orte_sstore_stage_local_wait_all_deps(); + } + else if( orte_sstore_context & ORTE_SSTORE_APP_TYPE ) { + opal_output(0, "sstore:stage:(App): wait_all_deps() Not supported!"); + } + + return ORTE_ERR_NOT_SUPPORTED; +} + +/************************** + * Local functions + **************************/ diff --git a/orte/runtime/data_type_support/orte_dt_copy_fns.c b/orte/runtime/data_type_support/orte_dt_copy_fns.c index ea7ab3bd23..a892a573d9 100644 --- a/orte/runtime/data_type_support/orte_dt_copy_fns.c +++ b/orte/runtime/data_type_support/orte_dt_copy_fns.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -192,6 +192,12 @@ int orte_dt_copy_app_context(orte_app_context_t **dest, orte_app_context_t *src, if( NULL != src->preload_files_src_dir) { (*dest)->preload_files_src_dir = strdup(src->preload_files_src_dir); } + +#if OPAL_ENABLE_FT_CR == 1 + if( NULL != src->sstore_load) { + (*dest)->sstore_load = strdup(src->sstore_load); + } +#endif return ORTE_SUCCESS; } diff --git a/orte/runtime/data_type_support/orte_dt_packing_fns.c b/orte/runtime/data_type_support/orte_dt_packing_fns.c index f1da6757dd..a49bdc89d5 100644 --- a/orte/runtime/data_type_support/orte_dt_packing_fns.c +++ b/orte/runtime/data_type_support/orte_dt_packing_fns.c @@ -741,7 +741,7 @@ int orte_dt_pack_app_context(opal_buffer_t *buffer, const void *src, return rc; } } - + /* pack the restart limits */ if (ORTE_SUCCESS != (rc = opal_dss_pack_buffer(buffer, (void*)(&(app_context[i]->max_local_restarts)), 1, OPAL_INT32))) { @@ -760,7 +760,30 @@ int orte_dt_pack_app_context(opal_buffer_t *buffer, const void *src, ORTE_ERROR_LOG(rc); return rc; } + + +#if OPAL_ENABLE_FT_CR == 1 + /* Pack the preload_files_src_dir if we have one */ + if (NULL != app_context[i]->sstore_load) { + have_preload_files_dest_dir = 1; + } else { + have_preload_files_dest_dir = 0; + } + if (ORTE_SUCCESS != (rc = opal_dss_pack_buffer(buffer, + (void*)(&have_preload_files_dest_dir), 1, OPAL_INT8))) { + ORTE_ERROR_LOG(rc); + return rc; + } + + if( have_preload_files_dest_dir) { + if (ORTE_SUCCESS != (rc = opal_dss_pack_buffer(buffer, + (void*)(&(app_context[i]->sstore_load)), 1, OPAL_STRING))) { + ORTE_ERROR_LOG(rc); + return rc; + } + } +#endif } return ORTE_SUCCESS; diff --git a/orte/runtime/data_type_support/orte_dt_print_fns.c b/orte/runtime/data_type_support/orte_dt_print_fns.c index 714223db48..cfd473ee6b 100644 --- a/orte/runtime/data_type_support/orte_dt_print_fns.c +++ b/orte/runtime/data_type_support/orte_dt_print_fns.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -574,7 +574,14 @@ int orte_dt_print_app_context(char **output, char *prefix, orte_app_context_t *s pfx2, (NULL == src->preload_files_src_dir) ? "NULL" : src->preload_files_src_dir); free(tmp); tmp = tmp2; - + +#if OPAL_ENABLE_FT_CR == 1 + asprintf(&tmp2, "%s\n%s\tSStore Load: %s", tmp, + pfx2, src->sstore_load); + free(tmp); + tmp = tmp2; +#endif + /* set the return */ *output = tmp; diff --git a/orte/runtime/data_type_support/orte_dt_unpacking_fns.c b/orte/runtime/data_type_support/orte_dt_unpacking_fns.c index 768b4e5de9..b191624c3a 100644 --- a/orte/runtime/data_type_support/orte_dt_unpacking_fns.c +++ b/orte/runtime/data_type_support/orte_dt_unpacking_fns.c @@ -816,7 +816,7 @@ int orte_dt_unpack_app_context(opal_buffer_t *buffer, void *dest, } else { app_context[i]->preload_files_src_dir = NULL; } - + /* unpack the restart limits */ max_n=1; if (ORTE_SUCCESS != (rc = opal_dss_unpack_buffer(buffer, &app_context[i]->max_local_restarts, @@ -830,7 +830,7 @@ int orte_dt_unpack_app_context(opal_buffer_t *buffer, void *dest, ORTE_ERROR_LOG(rc); return rc; } - + /* unpack the constrain flag */ max_n=1; if (ORTE_SUCCESS != (rc = opal_dss_unpack_buffer(buffer, &app_context[i]->constrain, @@ -838,6 +838,24 @@ int orte_dt_unpack_app_context(opal_buffer_t *buffer, void *dest, ORTE_ERROR_LOG(rc); return rc; } + +#if OPAL_ENABLE_FT_CR == 1 + /* Unpack the sstore_load */ + if (ORTE_SUCCESS != (rc = opal_dss_unpack_buffer(buffer, &have_preload_files_dest_dir, + &max_n, OPAL_INT8))) { + ORTE_ERROR_LOG(rc); + return rc; + } + if (have_preload_files_dest_dir) { + if (ORTE_SUCCESS != (rc = opal_dss_unpack_buffer(buffer, &app_context[i]->sstore_load, + &max_n, OPAL_STRING))) { + ORTE_ERROR_LOG(rc); + return rc; + } + } else { + app_context[i]->sstore_load = NULL; + } +#endif } return ORTE_SUCCESS; diff --git a/orte/runtime/orte_cr.c b/orte/runtime/orte_cr.c index 7c56818408..45ec57f52a 100644 --- a/orte/runtime/orte_cr.c +++ b/orte/runtime/orte_cr.c @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -43,6 +43,7 @@ #include "opal/util/opal_environ.h" #include "opal/util/output.h" +#include "opal/util/basename.h" #include "opal/event/event.h" #include "opal/mca/crs/crs.h" #include "opal/mca/crs/base/base.h" @@ -74,6 +75,9 @@ static int orte_cr_coord_post_ckpt(void); static int orte_cr_coord_post_restart(void); static int orte_cr_coord_post_continue(void); +bool orte_cr_continue_like_restart = false; +bool orte_cr_flush_restart_files = true; + /************* * Local vars *************/ @@ -129,6 +133,10 @@ int orte_cr_init(void) /* Register the ORTE interlevel coordination callback */ opal_cr_reg_coord_callback(orte_cr_coord, &prev_coord_callback); + + /* Typically this is not needed. Individual BTLs will set this as needed */ + orte_cr_continue_like_restart = false; + orte_cr_flush_restart_files = true; cleanup: @@ -249,8 +257,16 @@ static int orte_cr_coord_pre_ckpt(void) { } } - cleanup: + /* + * Record the job session directory + * This way we will recreate it on restart so that any components that + * have old references to it (like btl/sm) can reference their files + * (to close the fd's to them) on restart. We will remove it before we + * create the new session directory. + */ + orte_sstore.set_attr(orte_sstore_handle_current, SSTORE_METADATA_LOCAL_MKDIR, orte_process_info.job_session_dir); + cleanup: return exit_status; } @@ -293,10 +309,22 @@ static int orte_cr_coord_post_ckpt(void) { static int orte_cr_coord_post_restart(void) { int ret, exit_status = ORTE_SUCCESS; orte_proc_type_t prev_type = ORTE_PROC_TYPE_NONE; + char * tmp_dir = NULL; opal_output_verbose(10, orte_cr_output, "orte_cr: coord_post_restart: orte_cr_coord_post_restart()"); + /* + * Add the previous session directory for cleanup + */ + opal_crs_base_cleanup_append(orte_process_info.job_session_dir, true); + tmp_dir = opal_dirname(orte_process_info.job_session_dir); + if( NULL != tmp_dir ) { + opal_crs_base_cleanup_append(tmp_dir, true); + free(tmp_dir); + tmp_dir = NULL; + } + /* * Refresh System information */ diff --git a/orte/runtime/orte_cr.h b/orte/runtime/orte_cr.h index 3fb5e714a3..2d1b6760f4 100644 --- a/orte/runtime/orte_cr.h +++ b/orte/runtime/orte_cr.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -50,6 +50,13 @@ BEGIN_C_DECLS ORTE_DECLSPEC int orte_cr_entry_point_init(void); ORTE_DECLSPEC int orte_cr_entry_point_finalize(void); + /* + * If one of the BTLs that shutdown require a full, clean rebuild of the + * point-to-point stack on 'continue' as well as 'restart'. + */ + OPAL_DECLSPEC extern bool orte_cr_continue_like_restart; + OPAL_DECLSPEC extern bool orte_cr_flush_restart_files; + END_C_DECLS #endif /* ORTE_CR_H */ diff --git a/orte/runtime/orte_globals.c b/orte/runtime/orte_globals.c index 29596f9d82..d0f299bbea 100644 --- a/orte/runtime/orte_globals.c +++ b/orte/runtime/orte_globals.c @@ -531,6 +531,9 @@ static void orte_app_context_construct(orte_app_context_t* app_context) app_context->preload_files_dest_dir = NULL; app_context->preload_files_src_dir = NULL; app_context->used_on_node = false; +#if OPAL_ENABLE_FT_CR == 1 + app_context->sstore_load = NULL; +#endif app_context->max_local_restarts = -1; app_context->max_global_restarts = -1; app_context->constrain = true; @@ -605,6 +608,13 @@ static void orte_app_context_destructor(orte_app_context_t* app_context) free(app_context->preload_files_src_dir); app_context->preload_files_src_dir = NULL; } + +#if OPAL_ENABLE_FT_CR == 1 + if( NULL != app_context->sstore_load ) { + free(app_context->sstore_load); + app_context->sstore_load = NULL; + } +#endif } OBJ_CLASS_INSTANCE(orte_app_context_t, diff --git a/orte/runtime/orte_globals.h b/orte/runtime/orte_globals.h index 847d32695a..2605d1cc44 100644 --- a/orte/runtime/orte_globals.h +++ b/orte/runtime/orte_globals.h @@ -213,6 +213,10 @@ typedef struct { char *preload_files_src_dir; /* is being used on the local node */ bool used_on_node; +#if OPAL_ENABLE_FT_CR == 1 + /** What files SStore should load before local launch, if any */ + char *sstore_load; +#endif /* max number of times a process can be restarted locally */ int32_t max_local_restarts; /* max number of times a process can be relocated to another node */ diff --git a/orte/tools/Makefile.am b/orte/tools/Makefile.am index 6aee96847b..5a3874f247 100644 --- a/orte/tools/Makefile.am +++ b/orte/tools/Makefile.am @@ -1,6 +1,6 @@ # -*- makefile -*- # -# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # Copyright (c) 2004-2005 The University of Tennessee and The University @@ -34,7 +34,8 @@ SUBDIRS += \ tools/orterun \ tools/wrappers \ tools/orte-top \ - tools/orte-info + tools/orte-info \ + tools/orte-migrate DIST_SUBDIRS += \ tools/orte-bootproxy \ @@ -47,5 +48,6 @@ DIST_SUBDIRS += \ tools/orterun \ tools/wrappers \ tools/orte-top \ - tools/orte-info + tools/orte-info \ + tools/orte-migrate diff --git a/orte/tools/orte-checkpoint/help-orte-checkpoint.txt b/orte/tools/orte-checkpoint/help-orte-checkpoint.txt index 07fca2d058..ab347ddc92 100644 --- a/orte/tools/orte-checkpoint/help-orte-checkpoint.txt +++ b/orte/tools/orte-checkpoint/help-orte-checkpoint.txt @@ -100,3 +100,12 @@ argument to mpirun: The following feature was requested, but is not currently implemented. %s If you require this feature contact the Open MPI development group. + +[pid_not_found] +Error: The process with PID %d is not checkpointable. + This could be due to one of the following: + - An application with this PID doesn't currently exist + - The application with this PID isn't an Open MPI application. + +[hnp_not_found] +Error: The jobid specified by the '--hnp-jobid' option does not exist. diff --git a/orte/tools/orte-checkpoint/orte-checkpoint.c b/orte/tools/orte-checkpoint/orte-checkpoint.c index 3dd33b8a07..081848e235 100644 --- a/orte/tools/orte-checkpoint/orte-checkpoint.c +++ b/orte/tools/orte-checkpoint/orte-checkpoint.c @@ -77,6 +77,8 @@ #include "opal/dss/dss.h" #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" #include MCA_timer_IMPLEMENTATION_HEADER @@ -113,6 +115,9 @@ static int global_sequence_num = 0; * Global Vars for Command line Arguments *****************************************/ static bool listener_started = false; +static bool is_checkpoint_finished = false; +static bool is_checkpoint_established = false; +static bool is_checkpoint_recovered = false; static double timer_start = 0; static double timer_last = 0; @@ -132,6 +137,11 @@ typedef struct { int output; int ckpt_status; bool list_only; /* List available checkpoints only */ +#if OPAL_ENABLE_CRDEBUG == 1 + bool enable_crdebug; /* Enable C/R Debugging */ + bool attach_debugger; + bool detach_debugger; +#endif } orte_checkpoint_globals_t; orte_checkpoint_globals_t orte_checkpoint_globals; @@ -199,6 +209,26 @@ opal_cmd_line_init_t cmd_line_opts[] = { &orte_checkpoint_globals.list_only, OPAL_CMD_LINE_TYPE_BOOL, "Display a list of checkpoint files available on this machine" }, +#if OPAL_ENABLE_CRDEBUG == 1 + { NULL, NULL, NULL, + '\0', "crdebug", "crdebug", + 0, + &orte_checkpoint_globals.enable_crdebug, OPAL_CMD_LINE_TYPE_BOOL, + "Enable C/R Enhanced Debugging" }, + + { NULL, NULL, NULL, + '\0', "attach", "attach", + 0, + &(orte_checkpoint_globals.attach_debugger), OPAL_CMD_LINE_TYPE_BOOL, + "Wait for the debugger to attach directly after taking the checkpoint." }, + + { NULL, NULL, NULL, + '\0', "detach", "detach", + 0, + &(orte_checkpoint_globals.detach_debugger), OPAL_CMD_LINE_TYPE_BOOL, + "Do not wait for the debugger to reattach after taking the checkpoint." }, +#endif + /* End of list */ { NULL, NULL, NULL, '\0', NULL, NULL, 0, NULL, OPAL_CMD_LINE_TYPE_NULL, @@ -244,6 +274,10 @@ main(int argc, char *argv[]) /******************************* * Checkpoint the requested PID *******************************/ + is_checkpoint_finished = false; + is_checkpoint_recovered = false; + is_checkpoint_established = false; + if( orte_checkpoint_globals.verbose ) { opal_output_verbose(10, orte_checkpoint_globals.output, "orte_checkpoint: Checkpointing..."); @@ -282,31 +316,18 @@ main(int argc, char *argv[]) /* * Wait for the checkpoint to complete */ - if(!orte_checkpoint_globals.nowait) { - while( ORTE_SNAPC_CKPT_STATE_FINISHED != orte_checkpoint_globals.ckpt_status && - ORTE_SNAPC_CKPT_STATE_STOPPED != orte_checkpoint_globals.ckpt_status && - ORTE_SNAPC_CKPT_STATE_NO_CKPT != orte_checkpoint_globals.ckpt_status && - ORTE_SNAPC_CKPT_STATE_ERROR != orte_checkpoint_globals.ckpt_status ) { + if(!orte_checkpoint_globals.nowait) { + while( !is_checkpoint_finished ) { opal_progress(); } } - if( ORTE_SNAPC_CKPT_STATE_NO_CKPT == orte_checkpoint_globals.ckpt_status ) { + if( ORTE_SNAPC_CKPT_STATE_NO_CKPT == orte_checkpoint_globals.ckpt_status || + ORTE_SNAPC_CKPT_STATE_ERROR == orte_checkpoint_globals.ckpt_status ) { exit_status = ORTE_ERROR; goto cleanup; } - if( ORTE_SNAPC_CKPT_STATE_ERROR == orte_checkpoint_globals.ckpt_status ) { - opal_show_help("help-orte-checkpoint.txt", "ckpt_failure", true, - orte_checkpoint_globals.pid, ORTE_ERROR); - exit_status = ORTE_ERROR; - goto cleanup; - } - - if( orte_checkpoint_globals.status ) { - pretty_print_status(); - } - if(!orte_checkpoint_globals.nowait) { pretty_print_reference(); } @@ -341,10 +362,17 @@ static int parse_args(int argc, char *argv[]) { orte_checkpoint_globals.output = -1; orte_checkpoint_globals.ckpt_status = ORTE_SNAPC_CKPT_STATE_NONE; orte_checkpoint_globals.list_only = false; +#if OPAL_ENABLE_CRDEBUG == 1 + orte_checkpoint_globals.enable_crdebug = false; +#endif orte_checkpoint_globals.options = OBJ_NEW(opal_crs_base_ckpt_options_t); orte_checkpoint_globals.term = false; orte_checkpoint_globals.stop = false; +#if OPAL_ENABLE_CRDEBUG == 1 + orte_checkpoint_globals.attach_debugger = false; + orte_checkpoint_globals.detach_debugger = false; +#endif /* Parse the command line options */ opal_cmd_line_create(&cmd_line, cmd_line_opts); @@ -412,6 +440,10 @@ static int parse_args(int argc, char *argv[]) { orte_checkpoint_globals.options->term = orte_checkpoint_globals.term; orte_checkpoint_globals.options->stop = orte_checkpoint_globals.stop; +#if OPAL_ENABLE_CRDEBUG == 1 + orte_checkpoint_globals.options->attach_debugger = orte_checkpoint_globals.attach_debugger; + orte_checkpoint_globals.options->detach_debugger = orte_checkpoint_globals.detach_debugger; +#endif if(orte_checkpoint_globals.verbose_level < 0 ) { orte_checkpoint_globals.verbose_level = 0; @@ -703,9 +735,10 @@ static void process_ckpt_update_cmd(orte_process_name_t* sender, } orte_checkpoint_globals.ckpt_status = ckpt_status; - if( ORTE_SNAPC_CKPT_STATE_FINISHED == orte_checkpoint_globals.ckpt_status || - ORTE_SNAPC_CKPT_STATE_STOPPED == orte_checkpoint_globals.ckpt_status || - ORTE_SNAPC_CKPT_STATE_ERROR == orte_checkpoint_globals.ckpt_status ) { + if( ORTE_SNAPC_CKPT_STATE_RECOVERED == orte_checkpoint_globals.ckpt_status || + ORTE_SNAPC_CKPT_STATE_ESTABLISHED == orte_checkpoint_globals.ckpt_status || + ORTE_SNAPC_CKPT_STATE_STOPPED == orte_checkpoint_globals.ckpt_status || + ORTE_SNAPC_CKPT_STATE_ERROR == orte_checkpoint_globals.ckpt_status ) { count = 1; if ( ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &global_snapshot_handle, &count, OPAL_STRING)) ) { ORTE_ERROR_LOG(ret); @@ -727,17 +760,40 @@ static void process_ckpt_update_cmd(orte_process_name_t* sender, opal_show_help("help-orte-checkpoint.txt", "non-ckptable", true, orte_checkpoint_globals.pid); + is_checkpoint_finished = true; exit_status = ORTE_ERROR; goto cleanup; } - /* - * If we are to display the status progression - */ + + if( ORTE_SNAPC_CKPT_STATE_ERROR == orte_checkpoint_globals.ckpt_status) { + opal_show_help("help-orte-checkpoint.txt", "ckpt_failure", true, + orte_checkpoint_globals.pid, ORTE_ERROR); + is_checkpoint_finished = true; + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* Status progression */ if( orte_checkpoint_globals.status ) { - if(ORTE_SNAPC_CKPT_STATE_FINISHED != orte_checkpoint_globals.ckpt_status && - ORTE_SNAPC_CKPT_STATE_STOPPED != orte_checkpoint_globals.ckpt_status) { - pretty_print_status(); - } + pretty_print_status(); + } + + if( ORTE_SNAPC_CKPT_STATE_STOPPED == orte_checkpoint_globals.ckpt_status) { + is_checkpoint_finished = true; + goto cleanup; + } + + /* Normal termination check */ + if( (ORTE_SNAPC_CKPT_STATE_RECOVERED == orte_checkpoint_globals.ckpt_status && is_checkpoint_established) || + (ORTE_SNAPC_CKPT_STATE_ESTABLISHED == orte_checkpoint_globals.ckpt_status && is_checkpoint_recovered) ){ + is_checkpoint_finished = true; + goto cleanup; + } + else if( ORTE_SNAPC_CKPT_STATE_RECOVERED == orte_checkpoint_globals.ckpt_status ) { + is_checkpoint_recovered = true; + } + else if(ORTE_SNAPC_CKPT_STATE_ESTABLISHED == orte_checkpoint_globals.ckpt_status ) { + is_checkpoint_established = true; } cleanup: @@ -862,7 +918,16 @@ static int pretty_print_status(void) { return ORTE_SUCCESS; } -static int pretty_print_reference(void) { +static int pretty_print_reference(void) +{ +#if OPAL_ENABLE_CRDEBUG == 1 + if( orte_checkpoint_globals.enable_crdebug ) { + printf("Checkpoint handle: -s %3d %s\n", + global_sequence_num, + global_snapshot_handle); + return ORTE_SUCCESS; + } +#endif printf("Snapshot Ref.: %3d %s\n", global_sequence_num, @@ -873,61 +938,69 @@ static int pretty_print_reference(void) { static int list_all_snapshots(void) { int ret, exit_status = ORTE_SUCCESS; - char **snapshot_refs = NULL; - int i, num_snapshot_refs = 0; - int *snapshot_ref_seqs = NULL; - int s, num_snapshot_ref_seqs = 0; + opal_list_t *all_snapshots = NULL; + opal_list_item_t* item = NULL; + orte_sstore_base_global_snapshot_info_t *global_snapshot = NULL; + int s; - /* Get all of the snapshot references */ - if( ORTE_SUCCESS != (ret = orte_snapc_base_get_all_snapshot_refs(NULL, &num_snapshot_refs, &snapshot_refs) ) ) { + all_snapshots = OBJ_NEW(opal_list_t); + + if( ORTE_SUCCESS != (ret = orte_sstore_base_get_all_snapshots(all_snapshots, NULL)) ) { opal_output(0, "Error: Unable to list the checkpoints in the directory <%s>\n", - orte_snapc_base_global_snapshot_dir); + orte_sstore_base_global_snapshot_dir); ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; } - /* For each snapshot reference, get a list of the valid seq numbers */ - for(i = 0; i < num_snapshot_refs; ++i) { - if( ORTE_SUCCESS != (ret = orte_snapc_base_get_all_snapshot_ref_seqs(NULL, snapshot_refs[i], - &num_snapshot_ref_seqs, - &snapshot_ref_seqs) ) ) { - opal_output(0, "Error: Unable to list the sequence numbers for the checkpoint <%s> in directory <%s>\n", - snapshot_refs[i], - orte_snapc_base_global_snapshot_dir); + /* + * For each reference + */ + for(item = opal_list_get_first(all_snapshots); + item != opal_list_get_end(all_snapshots); + item = opal_list_get_next(item) ) { + global_snapshot = (orte_sstore_base_global_snapshot_info_t*)item; + + /* + * Get a list of valid sequence numbers + */ + if( ORTE_SUCCESS != (ret = orte_sstore_base_find_all_seq_nums(global_snapshot, + &(global_snapshot->num_seqs), + &(global_snapshot->all_seqs)))) { ORTE_ERROR_LOG(ret); exit_status = ret; goto cleanup; } - /* Pretty print the result */ - printf("Snapshot Ref.: %s\t[", snapshot_refs[i]); - if( 0 >= num_snapshot_ref_seqs ) { - printf("No Valid Checkpoints"); - } - for(s = 0; s < num_snapshot_ref_seqs; ++s) { - if( s != 0 ) { - printf(","); + s = 0; /* Silence a compiler warning */ +#if OPAL_ENABLE_CRDEBUG == 1 + /* Pretty print the result - C/R Debug version */ + if( orte_checkpoint_globals.enable_crdebug ) { + for(s = 0; s < global_snapshot->num_seqs; ++s) { + printf("-s %s %s\n", global_snapshot->all_seqs[s], global_snapshot->reference); } - printf("%d", snapshot_ref_seqs[s]); } - printf("]\n"); - - if( NULL != snapshot_ref_seqs ) { - free(snapshot_ref_seqs); - snapshot_ref_seqs = NULL; + else +#endif + { + /* Pretty print the result */ + printf("Snapshot Ref.: %s\t[", + global_snapshot->reference); + if( 0 >= global_snapshot->num_seqs ) { + printf("No Valid Checkpoints"); + } else { + printf("%s", + opal_argv_join(global_snapshot->all_seqs, ',')); + } + printf("]\n"); } } cleanup: - if( NULL != snapshot_ref_seqs ) { - free(snapshot_ref_seqs); - snapshot_ref_seqs = NULL; - } - if( NULL != snapshot_refs ) { - free(snapshot_refs); - snapshot_refs = NULL; + while (NULL != (item = opal_list_remove_first(all_snapshots))) { + OBJ_RELEASE(item); } + OBJ_RELEASE(all_snapshots); return exit_status; } diff --git a/orte/tools/orte-migrate/CMakeLists.txt b/orte/tools/orte-migrate/CMakeLists.txt new file mode 100644 index 0000000000..6d90e63fd1 --- /dev/null +++ b/orte/tools/orte-migrate/CMakeLists.txt @@ -0,0 +1,37 @@ +# Copyright (c) 2007-2009 High Performance Computing Center Stuttgart, +# University of Stuttgart. All rights reserved. +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +FILE(GLOB_RECURSE ORTE_MIGRATE_SOURCE_FILES "*.h" "*.c" "*.cc" "*.cpp") + +ADD_EXECUTABLE (orte-migrate ${ORTE_MIGRATE_SOURCE_FILES}) + +IF(BUILD_SHARED_LIBS) + SET_TARGET_PROPERTIES(orte-migrate PROPERTIES + COMPILE_FLAGS "-D_USRDLL -DOPAL_IMPORTS -DOMPI_IMPORTS -DORTE_IMPORTS /TP") +ENDIF(BUILD_SHARED_LIBS) + +TARGET_LINK_LIBRARIES (orte-migrate Ws2_32.lib shlwapi.lib) + +ADD_DEPENDENCIES (orte-migrate libopen-pal libopen-rte) + +ADD_CUSTOM_COMMAND (TARGET orte-migrate + POST_BUILD + COMMAND ${CMAKE_COMMAND} -E copy + ${OpenMPI_BINARY_DIR}/${CMAKE_CFG_INTDIR}/orte-migrate.exe + ${PROJECT_BINARY_DIR}/ompi-migrate.exe + COMMENT "Copying renamed executables...") + +INSTALL(TARGETS orte-migrate + DESTINATION bin) +INSTALL(FILES help-orte-migrate.txt DESTINATION share/openmpi) +INSTALL(FILES ${PROJECT_BINARY_DIR}/ompi-migrate.exe + DESTINATION bin) diff --git a/orte/tools/orte-migrate/Makefile.am b/orte/tools/orte-migrate/Makefile.am new file mode 100644 index 0000000000..38aabae10c --- /dev/null +++ b/orte/tools/orte-migrate/Makefile.am @@ -0,0 +1,42 @@ +# +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# + +include $(top_srcdir)/Makefile.man-page-rules + +man_pages = orte-migrate.1 +EXTRA_DIST = orte-migrate.1in CMakeLists.txt + +if !ORTE_DISABLE_FULL_SUPPORT +if WANT_FT +if OMPI_INSTALL_BINARIES + +bin_PROGRAMS = orte-migrate + +nodist_man_MANS = $(man_pages) + +# Ensure that the man pages are rebuilt if the opal_config.h file +# changes; a "good enough" way to know if configure was run again (and +# therefore the release date or version may have changed) +$(nodist_man_MANS): $(top_builddir)/opal/include/opal_config.h + +dist_pkgdata_DATA = help-orte-migrate.txt + +endif # OMPI_INSTALL_BINARIES + +orte_migrate_SOURCES = orte-migrate.c +orte_migrate_LDADD = $(top_builddir)/orte/libopen-rte.la + +endif # WANT_FT +endif # !ORTE_DISABLE_FULL_SUPPORT + +distclean-local: + rm -f $(man_pages) diff --git a/orte/tools/orte-migrate/help-orte-migrate.txt b/orte/tools/orte-migrate/help-orte-migrate.txt new file mode 100644 index 0000000000..7944c67639 --- /dev/null +++ b/orte/tools/orte-migrate/help-orte-migrate.txt @@ -0,0 +1,51 @@ +# -*- text -*- +# +# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +# University Research and Technology +# Corporation. All rights reserved. +# +# $COPYRIGHT$ +# +# Additional copyrights may follow +# +# $HEADER$ +# +# This is the US/English help file for Open MPI migrate tool +# +[usage] +ompi-migrate PID_OF_MPIRUN + Open MPI Process Migration Tool + +%s + +[invalid_pid] +Error: The PID (%d) is invalid because either you have not provided a PID + or provided an invalid PID. + Please see --help for usage. + +[no_universe] +Error: Unable to find the contact information for PID %d. + This could be due to one of the following: + - The PID is not that of an active MPIRUN. + - The application with this PID isn't migratable + - The application with this PID isn't an Open MPI application. + ompi-migrate attempted to find the session directory: + %s + +[unable_to_connect] +Error: Unable to connect to the Head Node Process to initiate the + migration of the application. + This could be due to one of the following: + - The PID is not that of an active MPIRUN. + - The application with this PID isn't migratable + - The application with this PID isn't an Open MPI application. + +[not_impl] +The following feature was requested, but is not currently implemented. + %s +If you require this feature contact the Open MPI development group. + +[err-inprogress] +Error: The Job identified by PID (%d) is currently migrating other processes. + Only one migration request can be processed at a time. Please try again + later. diff --git a/orte/tools/orte-migrate/orte-migrate.1in b/orte/tools/orte-migrate/orte-migrate.1in new file mode 100644 index 0000000000..9248477d3e --- /dev/null +++ b/orte/tools/orte-migrate/orte-migrate.1in @@ -0,0 +1,81 @@ +.\" +.\" Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana +.\" University Research and Technology +.\" Corporation. All rights reserved. +.\" +.\" Man page for OMPI's ompi-migrate command +.\" +.\" .TH name section center-footer left-footer center-header +.TH OMPI-MIGRATE 1 "#OMPI_DATE#" "#PACKAGE_VERSION#" "#PACKAGE_NAME#" +.\" ************************** +.\" Name Section +.\" ************************** +.SH NAME +. +ompi-migrate, orte-migrate \- Migrate processes among resources in Open MPI. +. +.PP +. +\fBNOTE:\fP \fIompi-migrate\fP, and \fIorte-migrate\fP are all exact +synonyms for each other. Using any of the names will result in exactly +identical behavior. +. +.\" ************************** +.\" Synopsis Section +.\" ************************** +.SH SYNOPSIS +. +.B ompi-migrate +.R [ options ] +.B +. +.\" ************************** +.\" Options Section +.\" ************************** +.SH Options +. +\fIorte-migrate\fR will attempt to notify a running parallel job (identified +by \fImpirun\fP) that a migration has been requeted. +. +.TP 10 +.B +Process ID of the \fImpirun\fP process. +. +. +.TP +.B -h | --help +Display help for this command +. +. +.TP +.B -v | --verbose +Enable verbose output for debugging. +. +. +.TP +.B -gmca | --gmca \fR \fP +Pass global MCA parameters that are applicable to all contexts. \fI\fP is +the parameter name; \fI\fP is the parameter value. +. +. +.TP +.B -mca | --mca +Send arguments to various MCA modules. +. +. +.\" ************************** +.\" Description Section +.\" ************************** +.SH DESCRIPTION +. +.PP +\fIorte-migrate\fR can be invoked multiple, non-overlapping times. +. +. +.\" ************************** +.\" See Also Section +.\" ************************** +. +.SH SEE ALSO + orte-ps(1), orte-clean(1), ompi-restart(1), ompi-checkpoint(1), opal-checkpoint(1), opal-restart(1), opal_crs(7) +. diff --git a/orte/tools/orte-migrate/orte-migrate.c b/orte/tools/orte-migrate/orte-migrate.c new file mode 100644 index 0000000000..793fe0e3bb --- /dev/null +++ b/orte/tools/orte-migrate/orte-migrate.c @@ -0,0 +1,768 @@ +/* + * Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana + * University Research and Technology + * Corporation. All rights reserved. + * + * $COPYRIGHT$ + * + * Additional copyrights may follow + * + * $HEADER$ + */ + +/** + * @file + * ORTE Process Migration Tool for migrating processes in a multiprocess job + * + */ + +#include "orte_config.h" +#include "orte/constants.h" + +#include +#include +#ifdef HAVE_STDLIB_H +#include +#endif /* HAVE_STDLIB_H */ +#ifdef HAVE_UNISTD_H +#include +#endif /* HAVE_UNISTD_H */ +#ifdef HAVE_FCNTL_H +#include +#endif /* HAVE_FCNTL_H */ +#ifdef HAVE_SYS_TYPES_H +#include +#endif /* HAVE_SYS_TYPES_H */ +#ifdef HAVE_SYS_STAT_H +#include /* for mkfifo */ +#endif /* HAVE_SYS_STAT_H */ +#ifdef HAVE_SYS_WAIT_H +#include +#endif /* HAVE_SYS_WAIT_H */ +#ifdef HAVE_STRING_H +#include +#endif /* HAVE_STRING_H */ + + +#include "opal/util/cmd_line.h" +#include "opal/util/output.h" +#include "opal/util/argv.h" +#include "opal/util/opal_environ.h" +#include "opal/mca/base/base.h" +#include "opal/mca/base/mca_base_param.h" +#include "opal/mca/crs/crs.h" +#include "opal/mca/crs/base/base.h" +#include "opal/runtime/opal.h" +#include "opal/runtime/opal_cr.h" + +#include "orte/runtime/runtime.h" +#include "orte/runtime/orte_cr.h" +#include "orte/util/hnp_contact.h" +#include "orte/runtime/orte_globals.h" +#include "orte/util/name_fns.h" +#include "opal/util/show_help.h" +#include "orte/util/proc_info.h" +#include "orte/mca/rml/rml.h" +#include "orte/mca/rml/rml_types.h" +#include "orte/mca/errmgr/errmgr.h" +#include "opal/dss/dss.h" +#include "orte/mca/snapc/snapc.h" +#include "orte/mca/snapc/base/base.h" + +#include "orte/mca/errmgr/errmgr.h" +#include "orte/mca/errmgr/base/base.h" + +#include MCA_timer_IMPLEMENTATION_HEADER + +/****************** + * Local Functions + ******************/ +static int tool_init(int argc, char *argv[]); /* Initalization routine */ +static int tool_finalize(void); /* Finalization routine */ +static int parse_args(int argc, char *argv[]); +static int find_hnp(void); + +static int start_listener(void); +static int stop_listener(void); +static void hnp_receiver(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata); + +static void process_ckpt_update_cmd(orte_process_name_t* sender, + opal_buffer_t* buffer); + +static int notify_hnp(void); +static int pretty_print_status(void); +static int pretty_print_migration(void); + +static orte_hnp_contact_t *orterun_hnp = NULL; +static int orte_migrate_ckpt_status = ORTE_ERRMGR_MIGRATE_STATE_NONE; + +/***************************************** + * Global Vars for Command line Arguments + *****************************************/ +static bool listener_started = false; + +static double timer_start = 0; +static double timer_last = 0; +static double get_time(void); + +typedef struct { + bool help; + int pid; + bool verbose; + int verbose_level; + bool status; + int output; + char *off_nodes; + char *off_procs; + char *onto_nodes; +} orte_migrate_globals_t; + +orte_migrate_globals_t orte_migrate_globals; + +opal_cmd_line_init_t cmd_line_opts[] = { + { NULL, NULL, NULL, + 'h', NULL, "help", + 0, + &orte_migrate_globals.help, OPAL_CMD_LINE_TYPE_BOOL, + "This help message" }, + + { NULL, NULL, NULL, + 'v', NULL, "verbose", + 0, + &orte_migrate_globals.verbose, OPAL_CMD_LINE_TYPE_BOOL, + "Be Verbose" }, + + { NULL, NULL, NULL, + 'V', NULL, NULL, + 1, + &orte_migrate_globals.verbose_level, OPAL_CMD_LINE_TYPE_INT, + "Set the verbosity level (For additional debugging information)" }, + + { "hnp-pid", NULL, NULL, + '\0', NULL, "hnp-pid", + 1, + &orte_migrate_globals.pid, OPAL_CMD_LINE_TYPE_INT, + "This should be the pid of the mpirun whose applications you wish " + "to migrate." }, + + { NULL, NULL, NULL, + 'x', NULL, "off", + 1, + &orte_migrate_globals.off_nodes, OPAL_CMD_LINE_TYPE_STRING, + "List of nodes to migrate off of (comma separated)" }, + + { NULL, NULL, NULL, + 'r', NULL, "ranks", + 1, + &orte_migrate_globals.off_procs, OPAL_CMD_LINE_TYPE_STRING, + "List of MPI_COMM_WORLD ranks to migrate (comma separated)" }, + + { NULL, NULL, NULL, + 't', NULL, "onto", + 1, + &orte_migrate_globals.onto_nodes, OPAL_CMD_LINE_TYPE_STRING, + "List of nodes to migrate onto (comma separated)" }, + + /* End of list */ + { NULL, NULL, NULL, '\0', NULL, NULL, 0, + NULL, OPAL_CMD_LINE_TYPE_NULL, + NULL } +}; + +int +main(int argc, char *argv[]) +{ + int ret, exit_status = ORTE_SUCCESS; + + /*************** + * Initialize + ***************/ + if (ORTE_SUCCESS != (ret = tool_init(argc, argv))) { + exit_status = ret; + goto cleanup; + } + + /*************************** + * Find the HNP that we want to connect to, if it exists + ***************************/ + if( orte_migrate_globals.verbose ) { + opal_output_verbose(10, orte_migrate_globals.output, + "orte_migrate: Finding HNP..."); + } + if (ORTE_SUCCESS != (ret = find_hnp())) { + opal_show_help("help-orte-migrate.txt", "invalid_pid", + true, orte_migrate_globals.pid); + exit_status = ret; + goto cleanup; + } + + /******************************* + * Send migration information to HNP + *******************************/ + if( orte_migrate_globals.verbose ) { + opal_output_verbose(10, orte_migrate_globals.output, + "orte_migrate: Sending info to HNP..."); + } + if (ORTE_SUCCESS != (ret = notify_hnp())) { + opal_output(0, + "HNP with PID %d Not found!", + orte_migrate_globals.pid); + exit_status = ret; + goto cleanup; + } + + /******************************* + * Wait for migration to complete + *******************************/ + while( ORTE_ERRMGR_MIGRATE_STATE_FINISH != orte_migrate_ckpt_status && + ORTE_ERRMGR_MIGRATE_STATE_ERROR != orte_migrate_ckpt_status && + ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS != orte_migrate_ckpt_status) { + opal_progress(); + } + + if( orte_migrate_globals.status ) { + orte_migrate_ckpt_status = ORTE_ERRMGR_MIGRATE_STATE_FINISH; + pretty_print_status(); + } + + cleanup: + /*************** + * Cleanup + ***************/ + if (ORTE_SUCCESS != (ret = tool_finalize())) { + return ret; + } + + return exit_status; +} + +static int parse_args(int argc, char *argv[]) { + int i, ret, len, exit_status = ORTE_SUCCESS ; + opal_cmd_line_t cmd_line; + char **app_env = NULL, **global_env = NULL; + char * tmp_env_var = NULL; + + /* Init structure */ + memset(&orte_migrate_globals, 0, sizeof(orte_migrate_globals_t)); + orte_migrate_globals.help = false; + orte_migrate_globals.pid = -1; + orte_migrate_globals.verbose = false; + orte_migrate_globals.verbose_level = 0; + orte_migrate_globals.status = false; + orte_migrate_globals.output = -1; + orte_migrate_globals.off_nodes = NULL; + orte_migrate_globals.off_procs = NULL; + orte_migrate_globals.onto_nodes = NULL; + + /* Parse the command line options */ + opal_cmd_line_create(&cmd_line, cmd_line_opts); + mca_base_open(); + mca_base_cmd_line_setup(&cmd_line); + ret = opal_cmd_line_parse(&cmd_line, true, argc, argv); + + /** + * Put all of the MCA arguments in the environment + */ + mca_base_cmd_line_process_args(&cmd_line, &app_env, &global_env); + + len = opal_argv_count(app_env); + for(i = 0; i < len; ++i) { + putenv(app_env[i]); + } + + len = opal_argv_count(global_env); + for(i = 0; i < len; ++i) { + putenv(global_env[i]); + } + + tmp_env_var = mca_base_param_env_var("opal_cr_is_tool"); + opal_setenv(tmp_env_var, + "1", + true, &environ); + free(tmp_env_var); + tmp_env_var = NULL; + + /** + * Now start parsing our specific arguments + */ + /* get the remaining bits */ + opal_cmd_line_get_tail(&cmd_line, &argc, &argv); + +#if OPAL_ENABLE_FT_CR == 0 + /* Warn and exit if not configured with Migrate/Restart */ + { + char *args = NULL; + args = opal_cmd_line_get_usage_msg(&cmd_line); + opal_show_help("help-orte-migrate.txt", "usage-no-cr", + true, args); + free(args); + exit_status = ORTE_ERROR; + goto cleanup; + } +#endif + + if (OPAL_SUCCESS != ret || + orte_migrate_globals.help || + 0 >= argc || + (NULL == orte_migrate_globals.off_nodes && NULL == orte_migrate_globals.off_procs) ) { + char *args = NULL; + args = opal_cmd_line_get_usage_msg(&cmd_line); + opal_show_help("help-orte-migrate.txt", "usage", true, + args); + free(args); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if(orte_migrate_globals.verbose_level < 0 ) { + orte_migrate_globals.verbose_level = 0; + } + + if(orte_migrate_globals.verbose_level > 0) { + orte_migrate_globals.verbose = true; + } + + /* + * If the user did not supply an hnp jobid, then they must + * supply the PID of MPIRUN + */ + if(0 >= argc ) { + exit_status = ORTE_SUCCESS; + goto cleanup; + } + + orte_migrate_globals.pid = atoi(argv[0]); + if ( 0 >= orte_migrate_globals.pid ) { + opal_show_help("help-orte-migrate.txt", "invalid_pid", true, + orte_migrate_globals.pid); + exit_status = ORTE_ERROR; + goto cleanup; + } + + if(orte_migrate_globals.verbose) { + orte_migrate_globals.status = true; + } + + if(orte_migrate_globals.verbose) { + pretty_print_migration(); + } + + cleanup: + return exit_status; +} + +/* + * This function attempts to find an HNP to connect to. + */ +static int find_hnp(void) { + int ret, exit_status = ORTE_SUCCESS; + opal_list_t hnp_list; + opal_list_item_t *item; + orte_hnp_contact_t *hnpcandidate; + + /* get the list of local hnp's available to us and setup + * contact info for them into the RML + */ + OBJ_CONSTRUCT(&hnp_list, opal_list_t); + if (ORTE_SUCCESS != (ret = orte_list_local_hnps(&hnp_list, true) ) ) { + ORTE_ERROR_LOG(ret); + exit_status = ret; + goto cleanup; + } + + /* search the list for the desired hnp */ + while (NULL != (item = opal_list_remove_first(&hnp_list))) { + hnpcandidate = (orte_hnp_contact_t*)item; + if( hnpcandidate->pid == orte_migrate_globals.pid) { + /* this is the one we want */ + orterun_hnp = hnpcandidate; + exit_status = ORTE_SUCCESS; + goto cleanup; + } + } + +cleanup: + while (NULL != (item = opal_list_remove_first(&hnp_list))) { + OBJ_RELEASE(item); + } + OBJ_DESTRUCT(&hnp_list); + + if( NULL == orterun_hnp ) { + return ORTE_ERROR; + } else { + return exit_status; + } +} + +static int tool_init(int argc, char *argv[]) { + int exit_status = ORTE_SUCCESS, ret; + char * tmp_env_var = NULL; + + listener_started = false; + + /* + * Make sure to init util before parse_args + * to ensure installdirs is setup properly + * before calling mca_base_open(); + */ + if( ORTE_SUCCESS != (ret = opal_init_util(&argc, &argv)) ) { + return ret; + } + + /* + * Parse Command Line Arguments + */ + if (ORTE_SUCCESS != (ret = parse_args(argc, argv))) { + return ret; + } + + /* Disable the migrate notification routine for this + * tool. As we will never need to migrate this tool. + * Note: This must happen before opal_init(). + */ + opal_cr_set_enabled(false); + + /* Select the none component, since we don't actually use a migrateer */ + tmp_env_var = mca_base_param_env_var("crs"); + opal_setenv(tmp_env_var, + "none", + true, &environ); + free(tmp_env_var); + tmp_env_var = NULL; + + /*************************** + * We need all of OPAL and the TOOLS portion of ORTE - this + * sets us up so we can talk to any HNP over the wire + ***************************/ + if (ORTE_SUCCESS != (ret = orte_init(&argc, &argv, ORTE_PROC_TOOL))) { + exit_status = ret; + goto cleanup; + } + + /* + * Setup ORTE Output handle from the verbose argument + */ + if( orte_migrate_globals.verbose ) { + orte_migrate_globals.output = opal_output_open(NULL); + opal_output_set_verbosity(orte_migrate_globals.output, orte_migrate_globals.verbose_level); + } else { + orte_migrate_globals.output = 0; /* Default=STDERR */ + } + + /* + * Start the listener + */ + if( ORTE_SUCCESS != (ret = start_listener() ) ) { + exit_status = ret; + } + + cleanup: + return exit_status; +} + +static int tool_finalize(void) { + int exit_status = ORTE_SUCCESS, ret; + + /* + * Stop the listener + */ + if( ORTE_SUCCESS != (ret = stop_listener() ) ) { + exit_status = ret; + } + + if (ORTE_SUCCESS != (ret = orte_finalize())) { + exit_status = ret; + goto cleanup; + } + + cleanup: + return exit_status; +} + +static int start_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_MIGRATE, + ORTE_RML_PERSISTENT, + hnp_receiver, + NULL))) { + exit_status = ret; + goto cleanup; + } + + listener_started = true; + + cleanup: + return exit_status; +} + +static int stop_listener(void) +{ + int ret, exit_status = ORTE_SUCCESS; + + if( !listener_started ) { + exit_status = ORTE_ERROR; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD, + ORTE_RML_TAG_MIGRATE))) { + exit_status = ret; + goto cleanup; + } + + listener_started = false; + cleanup: + return exit_status; +} + +static void hnp_receiver(int status, + orte_process_name_t* sender, + opal_buffer_t* buffer, + orte_rml_tag_t tag, + void* cbdata) +{ + orte_errmgr_tool_cmd_flag_t command; + orte_std_cntr_t count; + int rc; + + opal_output_verbose(5, orte_migrate_globals.output, + "orte_migrate: hnp_receiver: Receive a command message."); + + /* + * Otherwise this is an inter-coordinator command (usually updating state info). + */ + count = 1; + if (ORTE_SUCCESS != (rc = opal_dss.unpack(buffer, &command, &count, ORTE_ERRMGR_MIGRATE_TOOL_CMD))) { + ORTE_ERROR_LOG(rc); + return; + } + + switch (command) { + case ORTE_ERRMGR_MIGRATE_TOOL_UPDATE_CMD: + opal_output_verbose(10, orte_migrate_globals.output, + "orte_migrate: hnp_receiver: Status Update."); + + process_ckpt_update_cmd(sender, buffer); + break; + + case ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD: + /* Do Nothing */ + break; + + default: + ORTE_ERROR_LOG(ORTE_ERR_VALUE_OUT_OF_BOUNDS); + } +} + +static void process_ckpt_update_cmd(orte_process_name_t* sender, + opal_buffer_t* buffer) +{ + int ret, exit_status = ORTE_SUCCESS; + orte_std_cntr_t count = 1; + int ckpt_status = ORTE_ERRMGR_MIGRATE_STATE_NONE; + + /* + * Receive the data: + * - ckpt_state + */ + count = 1; + if ( ORTE_SUCCESS != (ret = opal_dss.unpack(buffer, &ckpt_status, &count, OPAL_INT)) ) { + exit_status = ret; + goto cleanup; + } + orte_migrate_ckpt_status = ckpt_status; + + /* + * If the job is not able to be migrateed, then return + */ + if( ORTE_SNAPC_CKPT_STATE_NO_CKPT == orte_migrate_ckpt_status) { + opal_show_help("help-orte-migrate.txt", "non-ckptable", + true, + orte_migrate_globals.pid); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * If a migration is already in progress, then we must tell the user to + * try again later. + */ + if( ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS == orte_migrate_ckpt_status) { + opal_show_help("help-orte-migrate.txt", "err-inprogress", + true, + orte_migrate_globals.pid); + exit_status = ORTE_ERROR; + goto cleanup; + } + + /* + * If we are to display the status progression + */ + if( orte_migrate_globals.status ) { + if(ORTE_ERRMGR_MIGRATE_STATE_FINISH != orte_migrate_ckpt_status) { + pretty_print_status(); + } + } + + cleanup: + return; +} + +static int notify_hnp(void) +{ + int ret, exit_status = ORTE_SUCCESS; + opal_buffer_t *buffer = NULL; + orte_errmgr_tool_cmd_flag_t command = ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD; + + if (NULL == (buffer = OBJ_NEW(opal_buffer_t))) { + exit_status = ORTE_ERROR; + goto cleanup; + } + + opal_output_verbose(10, orte_migrate_globals.output, + "orte_migrate: notify_hnp: Contact Head Node Process PID %d\n", + orte_migrate_globals.pid); + + timer_start = get_time(); + + /*********************************** + * Notify HNP of migrate request + * Send: + * - Command + * - Off Nodes + * - Off Procs + * - Onto Nodes + ***********************************/ + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &command, 1, ORTE_ERRMGR_MIGRATE_TOOL_CMD)) ) { + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(orte_migrate_globals.off_procs), 1, OPAL_STRING)) ) { + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(orte_migrate_globals.off_nodes), 1, OPAL_STRING)) ) { + exit_status = ret; + goto cleanup; + } + + if (ORTE_SUCCESS != (ret = opal_dss.pack(buffer, &(orte_migrate_globals.onto_nodes), 1, OPAL_STRING)) ) { + exit_status = ret; + goto cleanup; + } + + if ( 0 > (ret = orte_rml.send_buffer(&(orterun_hnp->name), buffer, ORTE_RML_TAG_MIGRATE, 0)) ) { + exit_status = ret; + goto cleanup; + } + + cleanup: + if( NULL != buffer) { + OBJ_RELEASE(buffer); + buffer = NULL; + } + + if( ORTE_SUCCESS != exit_status ) { + opal_show_help("help-orte-migrate.txt", "unable_to_connect", true, + orte_migrate_globals.pid); + } + + return exit_status; +} + +/*************** + * Pretty Print + ***************/ +static double get_time(void) { + double wtime; + +#if OPAL_TIMER_USEC_NATIVE + wtime = (double)opal_timer_base_get_usec() / 1000000.0; +#else + struct timeval tv; + gettimeofday(&tv, NULL); + wtime = tv.tv_sec; + wtime += (double)tv.tv_usec / 1000000.0; +#endif + + return wtime; +} + +static int pretty_print_status(void) { + char * state_str = NULL; + double cur_time; + + cur_time = get_time(); + + if( timer_last == 0 ) { + timer_last = cur_time; + } + + orte_errmgr_base_migrate_state_str(&state_str, orte_migrate_ckpt_status); + + opal_output(0, + "[%6.2f / %6.2f] %*s - ...\n", + (cur_time - timer_last), (cur_time - timer_start), + 25, state_str); + + if( NULL != state_str) { + free(state_str); + } + + timer_last = cur_time; + + return ORTE_SUCCESS; +} + +static int pretty_print_migration(void) +{ + char **loc_off_nodes = NULL; + char **loc_off_procs = NULL; + char **loc_onto_nodes = NULL; + int loc_off_nodes_cnt = 0; + int loc_off_procs_cnt = 0; + int loc_onto_cnt = 0; + int i; + + if( NULL != orte_migrate_globals.off_nodes ) { + loc_off_nodes = opal_argv_split(orte_migrate_globals.off_nodes, ','); + loc_off_nodes_cnt = opal_argv_count(loc_off_nodes); + } + + if( NULL != orte_migrate_globals.off_procs ) { + loc_off_procs = opal_argv_split(orte_migrate_globals.off_procs, ','); + loc_off_procs_cnt = opal_argv_count(loc_off_procs); + } + + if( NULL != orte_migrate_globals.onto_nodes ) { + loc_onto_nodes = opal_argv_split(orte_migrate_globals.onto_nodes, ','); + loc_onto_cnt = opal_argv_count(loc_onto_nodes); + } + + printf("Migrate Nodes: (%d nodes)\n", loc_off_nodes_cnt); + for(i = 0; i < loc_off_nodes_cnt; ++i) { + printf("\t\"%s\"\n", loc_off_nodes[i]); + } + + printf("Migrate Ranks: (%d ranks)\n", loc_off_procs_cnt); + for(i = 0; i < loc_off_procs_cnt; ++i) { + printf("\t\"%s\"\n", loc_off_procs[i]); + } + + printf("Migrate Onto : (%d nodes)\n", loc_onto_cnt); + for(i = 0; i < loc_onto_cnt; ++i) { + printf("\t\"%s\"\n", loc_onto_nodes[i]); + } + + return ORTE_SUCCESS; +} + diff --git a/orte/tools/orte-restart/help-orte-restart.txt b/orte/tools/orte-restart/help-orte-restart.txt index 8bf25bede3..f14be54328 100644 --- a/orte/tools/orte-restart/help-orte-restart.txt +++ b/orte/tools/orte-restart/help-orte-restart.txt @@ -1,6 +1,6 @@ # -*- text -*- # -# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana +# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana # University Research and Technology # Corporation. All rights reserved. # Copyright (c) 2004-2005 The University of Tennessee and The University @@ -35,10 +35,15 @@ ompi-restart GLOBAL_SNAPSHOT_REF %s [invalid_filename] -Error: The filename (%s) is invalid because either you have not provided a filename - or provided an invalid filename. +Error: The filename provided (referenced below) could not be used for + restarting the job. This could be for a variety of reasons: + - The filename/handle is invalid, + - The snapshot directory no longer exisits, or + - There are no stable checkpoint sequences in this global snapshot. Please see --help for usage. +Filename: %s + [restart_cmd_failure] Error: Unable to obtain the proper restart command to restart from the checkpoint file (%s). Returned %d. @@ -60,4 +65,7 @@ Error: The filename (%s) and sequence number (%d) could not be used. This may be caused by an invalid sequence number. Try using the '-i' option to determine a correct value. - +[amca_param_not_found] +Warning: Unable to find the AMCA parameter in the checkpoint metadata. + This is the option supplied to mpirun as '-am '. Restart will + assume this value to be '%s'. diff --git a/orte/tools/orte-restart/orte-restart.c b/orte/tools/orte-restart/orte-restart.c index ff96f92a21..189ac106bb 100644 --- a/orte/tools/orte-restart/orte-restart.c +++ b/orte/tools/orte-restart/orte-restart.c @@ -55,6 +55,7 @@ #include "opal/util/argv.h" #include "opal/util/opal_environ.h" #include "opal/util/basename.h" +#include "opal/util/path.h" #include "opal/mca/base/base.h" #include "opal/mca/base/mca_base_param.h" #include "opal/mca/crs/crs.h" @@ -64,6 +65,8 @@ #include "orte/runtime/orte_cr.h" #include "orte/mca/snapc/snapc.h" #include "orte/mca/snapc/base/base.h" +#include "orte/mca/sstore/sstore.h" +#include "orte/mca/sstore/base/base.h" #include "orte/mca/filem/base/base.h" #include "opal/util/show_help.h" #include "orte/util/proc_info.h" @@ -74,10 +77,9 @@ static int initialize(int argc, char *argv[]); static int finalize(void); static int parse_args(int argc, char *argv[]); -static int check_file(orte_snapc_base_global_snapshot_t *snapshot); -static int create_appfile(orte_snapc_base_global_snapshot_t *snapshot); -static int spawn_children(orte_snapc_base_global_snapshot_t *snapshot, pid_t *child_pid); -static int snapshot_info(orte_snapc_base_global_snapshot_t *snapshot); +static int create_appfile(orte_sstore_base_global_snapshot_info_t *snapshot); +static int spawn_children(orte_sstore_base_global_snapshot_info_t *snapshot, pid_t *child_pid); +static int snapshot_info(orte_sstore_base_global_snapshot_info_t *snapshot); static int snapshot_sort_compare_fn(opal_list_item_t **a, opal_list_item_t **b); @@ -86,16 +88,20 @@ static int snapshot_sort_compare_fn(opal_list_item_t **a, *****************************************/ typedef struct { bool help; - char *filename; + char *snapshot_ref; char *appfile; bool verbose; bool forked; - bool preload; int seq_number; char *hostfile; int output; bool info_only; bool app_only; + bool showme; + char *mpirun_opts; +#if OPAL_ENABLE_CRDEBUG == 1 + bool enable_crdebug; +#endif } orte_restart_globals_t; orte_restart_globals_t orte_restart_globals; @@ -113,12 +119,6 @@ opal_cmd_line_init_t cmd_line_opts[] = { &orte_restart_globals.verbose, OPAL_CMD_LINE_TYPE_BOOL, "Be Verbose" }, - { NULL, NULL, NULL, - 'p', NULL, "preload", - 0, - &orte_restart_globals.preload, OPAL_CMD_LINE_TYPE_BOOL, - "Preload the checkpoint files before restarting (Default = Disabled)" }, - { NULL, NULL, NULL, '\0', NULL, "fork", 0, @@ -157,6 +157,26 @@ opal_cmd_line_init_t cmd_line_opts[] = { &orte_restart_globals.app_only, OPAL_CMD_LINE_TYPE_BOOL, "Only create the app context file, do not restart from it" }, + { NULL, NULL, NULL, + '\0', NULL, "showme", + 0, + &orte_restart_globals.showme, OPAL_CMD_LINE_TYPE_BOOL, + "Display the full command line that would have been exec'ed." }, + + { NULL, NULL, NULL, + '\0', "mpirun_opts", "mpirun_opts", + 1, + &orte_restart_globals.mpirun_opts, OPAL_CMD_LINE_TYPE_STRING, + "Command line options to pass directly to mpirun (be sure to quote long strings, and escape internal quotes)" }, + +#if OPAL_ENABLE_CRDEBUG == 1 + { NULL, NULL, NULL, + '\0', "crdebug", "crdebug", + 0, + &orte_restart_globals.enable_crdebug, OPAL_CMD_LINE_TYPE_BOOL, + "Enable C/R Enhanced Debugging" }, +#endif + /* End of list */ { NULL, NULL, NULL, '\0', NULL, NULL, @@ -170,7 +190,8 @@ main(int argc, char *argv[]) { int ret, exit_status = ORTE_SUCCESS; pid_t child_pid = 0; - orte_snapc_base_global_snapshot_t *snapshot = NULL; + orte_sstore_base_global_snapshot_info_t *snapshot = NULL; + char *basedir = NULL; char *tmp_str = NULL; /*************** @@ -181,22 +202,42 @@ main(int argc, char *argv[]) goto cleanup; } - snapshot = OBJ_NEW(orte_snapc_base_global_snapshot_t); - snapshot->reference_name = strdup(orte_restart_globals.filename); - orte_snapc_base_get_global_snapshot_directory(&tmp_str, snapshot->reference_name); - snapshot->local_location = opal_dirname(tmp_str); - free(tmp_str); - tmp_str = NULL; + snapshot = OBJ_NEW(orte_sstore_base_global_snapshot_info_t); - /* - * Check for existence of the file + if( opal_path_is_absolute(orte_restart_globals.snapshot_ref) ) { + basedir = opal_dirname(orte_restart_globals.snapshot_ref); + tmp_str = opal_basename(orte_restart_globals.snapshot_ref); + free(orte_restart_globals.snapshot_ref); + orte_restart_globals.snapshot_ref = strdup(tmp_str); + free(tmp_str); + tmp_str = NULL; + } else if( NULL != strchr(orte_restart_globals.snapshot_ref, '/') ) { + basedir = opal_dirname(orte_restart_globals.snapshot_ref); + tmp_str = opal_basename(orte_restart_globals.snapshot_ref); + free(orte_restart_globals.snapshot_ref); + orte_restart_globals.snapshot_ref = strdup(tmp_str); + free(tmp_str); + tmp_str = NULL; + } else { + basedir = NULL; /* Use MCA parameter */ + } + + /* + * Note: If the seq # passed is -1, then the largest seq # is selected, + * ow the seq # requested is selected if available + * 'basedir': Snapshot Base location to look in. If NULL then MCA parameter is used */ - if( ORTE_SUCCESS != (ret = check_file(snapshot)) ) { + if( ORTE_SUCCESS != (ret = orte_sstore.request_restart_handle(&(snapshot->ss_handle), + basedir, + orte_restart_globals.snapshot_ref, + orte_restart_globals.seq_number, + snapshot))) { opal_show_help("help-orte-restart.txt", "invalid_filename", true, - orte_restart_globals.filename); + orte_restart_globals.snapshot_ref); exit_status = ret; goto cleanup; } + orte_restart_globals.seq_number = snapshot->seq_num; if(orte_restart_globals.info_only ) { if (ORTE_SUCCESS != (ret = snapshot_info(snapshot))) { @@ -227,7 +268,7 @@ main(int argc, char *argv[]) if( orte_restart_globals.verbose ) { opal_output_verbose(10, orte_restart_globals.output, "Restarting from file (%s)", - orte_restart_globals.filename); + orte_restart_globals.snapshot_ref); if( orte_restart_globals.forked ) { opal_output_verbose(10, orte_restart_globals.output, @@ -240,7 +281,7 @@ main(int argc, char *argv[]) if( ORTE_SUCCESS != (ret = spawn_children(snapshot, &child_pid)) ) { opal_show_help("help-orte-restart.txt", "restart_cmd_failure", true, - orte_restart_globals.filename, ret); + orte_restart_globals.snapshot_ref, ret); exit_status = ret; goto cleanup; } @@ -252,7 +293,15 @@ main(int argc, char *argv[]) * Cleanup ***************/ cleanup: - if(NULL != snapshot ) { + if( NULL != basedir ) { + free(basedir); + basedir = NULL; + } + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + if( NULL != snapshot ) { OBJ_RELEASE(snapshot); snapshot = NULL; } @@ -354,14 +403,18 @@ static int parse_args(int argc, char *argv[]) NULL, /* appfile */ false, /* verbose */ false, /* forked */ - false, /* preload */ -1, /* seq_number */ NULL, /* hostfile */ -1, /* output*/ false, /* info only */ - false };/* app only */ + false, /* app only */ + false, /* showme */ + NULL}; /* mpirun_opts */ orte_restart_globals = tmp; +#if OPAL_ENABLE_CRDEBUG == 1 + orte_restart_globals.enable_crdebug = false; +#endif /* Parse the command line options */ opal_cmd_line_create(&cmd_line, cmd_line_opts); @@ -430,11 +483,11 @@ static int parse_args(int argc, char *argv[]) return ORTE_ERROR; } - orte_restart_globals.filename = strdup(argv[0]); - if ( NULL == orte_restart_globals.filename || - 0 >= strlen(orte_restart_globals.filename) ) { + orte_restart_globals.snapshot_ref = strdup(argv[0]); + if ( NULL == orte_restart_globals.snapshot_ref || + 0 >= strlen(orte_restart_globals.snapshot_ref) ) { opal_show_help("help-orte-restart.txt", "invalid_filename", true, - orte_restart_globals.filename); + orte_restart_globals.snapshot_ref); return ORTE_ERROR; } @@ -442,61 +495,58 @@ static int parse_args(int argc, char *argv[]) * need to be grouped together. */ if(argc > 1) { - orte_restart_globals.filename = strdup(opal_argv_join(argv, ' ')); + orte_restart_globals.snapshot_ref = strdup(opal_argv_join(argv, ' ')); } return ORTE_SUCCESS; } -static int check_file(orte_snapc_base_global_snapshot_t *snapshot) +static int create_appfile(orte_sstore_base_global_snapshot_info_t *snapshot) { - int ret, exit_status = ORTE_SUCCESS; - - opal_output_verbose(10, orte_restart_globals.output, - "Checking for the existence of (%s)\n", - snapshot->local_location); - - if (0 > (ret = access(snapshot->local_location, F_OK)) ) { - exit_status = ORTE_ERROR; - goto cleanup; - } - - cleanup: - return exit_status; -} - -static int create_appfile(orte_snapc_base_global_snapshot_t *snapshot) -{ - int ret, exit_status = ORTE_SUCCESS; + int exit_status = ORTE_SUCCESS; FILE *appfile = NULL; opal_list_item_t* item = NULL; - - /* - * Extract the record information for the specified seq number. - * Note: If the seq # passed is -1, then the largest seq # is selected, - * ow the seq # requested is selected if available - */ - snapshot->seq_num = orte_restart_globals.seq_number; - if( ORTE_SUCCESS != (ret = orte_snapc_base_extract_metadata( snapshot ) ) ) { - opal_show_help("help-orte-restart.txt", "invalid_seq_num", true, - orte_restart_globals.filename, - (int)orte_restart_globals.seq_number); - exit_status = ret; - goto cleanup; - } + char *tmp_str = NULL; + char *amca_param = NULL; + char *reference_fmt_str = NULL; + char *location_str = NULL; + char *ref_location_fmt_str = NULL; + orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL; /* * Create the appfile */ + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_LOC_ABS, + &tmp_str); asprintf(&orte_restart_globals.appfile, "%s/%s", - snapshot->local_location, + tmp_str, strdup("restart-appfile")); + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_GLOBAL_AMCA_PARAM, + &amca_param); if (NULL == (appfile = fopen(orte_restart_globals.appfile, "w")) ) { - exit_status = ret; + exit_status = ORTE_ERROR; goto cleanup; } + /* This will give a format string that we can use */ + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_LOCAL_SNAP_REF_FMT, + &reference_fmt_str); + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_LOCAL_SNAP_LOC, + &location_str); + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT, + &ref_location_fmt_str); + /* * Sort the snapshots so that they are in order */ @@ -508,8 +558,7 @@ static int create_appfile(orte_snapc_base_global_snapshot_t *snapshot) for(item = opal_list_get_first(&snapshot->local_snapshots); item != opal_list_get_end(&snapshot->local_snapshots); item = opal_list_get_next(item) ) { - orte_snapc_base_local_snapshot_t *vpid_snapshot; - vpid_snapshot = (orte_snapc_base_local_snapshot_t*)item; + vpid_snapshot = (orte_sstore_base_local_snapshot_info_t*)item; fprintf(appfile, "#\n"); fprintf(appfile, "# Old Process Name: %u.%u\n", @@ -517,48 +566,78 @@ static int create_appfile(orte_snapc_base_global_snapshot_t *snapshot) vpid_snapshot->process_name.vpid); fprintf(appfile, "#\n"); fprintf(appfile, "-np 1 "); - if(orte_restart_globals.preload) { - fprintf(appfile, "--preload-files %s/%s ", - vpid_snapshot->local_location, - vpid_snapshot->reference_name); - fprintf(appfile, "--preload-files-dest-dir . "); + + fprintf(appfile, "--sstore-load "); + /* loc:ref:postfix:seq */ + fprintf(appfile, "%s:%s:", + location_str, + orte_restart_globals.snapshot_ref); + fprintf(appfile, reference_fmt_str, vpid_snapshot->process_name.vpid); + fprintf(appfile, ":%s:%s:%d ", + (vpid_snapshot->compress_comp == NULL ? "" : vpid_snapshot->compress_comp), + (vpid_snapshot->compress_postfix == NULL ? "" : vpid_snapshot->compress_postfix), + orte_restart_globals.seq_number); + + if( NULL == amca_param ) { + amca_param = strdup("ft-enable-cr"); + opal_show_help("help-orte-restart.txt", "amca_param_not_found", true, + amca_param); } - /* JJH: Make this match what the user originally specified on the command line */ - fprintf(appfile, "-am ft-enable-cr "); + fprintf(appfile, "-am %s ", amca_param); fprintf(appfile, " opal-restart "); - /* JJH: Make sure this changes if ever the default location of the local file is changed, - * currently it is safe to assume that it is in the current working directory. - * - * JJH: If we allow inplace restarting then this may be another directory... */ - if(orte_restart_globals.preload) { - /* If we preloaded the files then they are in the current working - * directory. */ - fprintf(appfile, "-mca crs_base_snapshot_dir . "); - } - else { - /* If we are *not* preloading the files, the point to the original checkpoint - * directory to access the checkpoint files. */ - fprintf(appfile, "-mca crs_base_snapshot_dir %s ", vpid_snapshot->local_location); - } - fprintf(appfile, "%s\n", vpid_snapshot->reference_name); + /* + * By default, point to the central storage location of the checkpoint. + * The active SStore module at restart time will determine if files + * need to be preloaded. + */ + fprintf(appfile, "-l %s", location_str); + fprintf(appfile, " -m %s ", orte_sstore_base_local_metadata_filename); + + fprintf(appfile, "-r "); + fprintf(appfile, reference_fmt_str, vpid_snapshot->process_name.vpid); + + fprintf(appfile, "\n"); } cleanup: - if(NULL != appfile) + if(NULL != appfile) { fclose(appfile); - + appfile = NULL; + } + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + if( NULL != location_str ) { + free(location_str); + location_str = NULL; + } + if( NULL != reference_fmt_str ) { + free(reference_fmt_str); + reference_fmt_str = NULL; + } + if( NULL != ref_location_fmt_str ) { + free(ref_location_fmt_str); + ref_location_fmt_str = NULL; + } + return exit_status; } -static int spawn_children(orte_snapc_base_global_snapshot_t *snapshot, pid_t *child_pid) +static int spawn_children(orte_sstore_base_global_snapshot_info_t *snapshot, pid_t *child_pid) { int ret, exit_status = ORTE_SUCCESS; + char *amca_param = NULL; char **argv = NULL; - int argc = 0; + int argc = 0, i; int status; + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_GLOBAL_AMCA_PARAM, + &amca_param); + if( ORTE_SUCCESS != (ret = opal_argv_append(&argc, &argv, "mpirun")) ) { exit_status = ret; goto cleanup; @@ -567,7 +646,12 @@ static int spawn_children(orte_snapc_base_global_snapshot_t *snapshot, pid_t *ch exit_status = ret; goto cleanup; } - if( ORTE_SUCCESS != (ret = opal_argv_append(&argc, &argv, "ft-enable-cr")) ) { + if( NULL == amca_param ) { + amca_param = strdup("ft-enable-cr"); + opal_show_help("help-orte-restart.txt", "amca_param_not_found", true, + amca_param); + } + if( ORTE_SUCCESS != (ret = opal_argv_append(&argc, &argv, amca_param)) ) { exit_status = ret; goto cleanup; } @@ -581,6 +665,20 @@ static int spawn_children(orte_snapc_base_global_snapshot_t *snapshot, pid_t *ch goto cleanup; } } + if( orte_restart_globals.mpirun_opts ) { + if( ORTE_SUCCESS != (ret = opal_argv_append(&argc, &argv, orte_restart_globals.mpirun_opts)) ) { + exit_status = ret; + goto cleanup; + } + } +#if OPAL_ENABLE_CRDEBUG == 1 + if( orte_restart_globals.enable_crdebug ) { + if( ORTE_SUCCESS != (ret = opal_argv_append(&argc, &argv, "--crdebug")) ) { + exit_status = ret; + goto cleanup; + } + } +#endif if( ORTE_SUCCESS != (ret = opal_argv_append(&argc, &argv, "--app")) ) { exit_status = ret; goto cleanup; @@ -590,6 +688,15 @@ static int spawn_children(orte_snapc_base_global_snapshot_t *snapshot, pid_t *ch goto cleanup; } + if( orte_restart_globals.showme ) { + for(i = 0; i < argc; ++i ) { + /*printf("%2d: (%s)\n", i, argv[i]);*/ + printf("%s ", argv[i]); + } + printf("\n"); + return ORTE_SUCCESS; + } + /* To fork off a child */ if( orte_restart_globals.forked ) { *child_pid = fork(); @@ -637,105 +744,86 @@ static int spawn_children(orte_snapc_base_global_snapshot_t *snapshot, pid_t *ch return exit_status; } -int snapshot_info(orte_snapc_base_global_snapshot_t *snapshot) +int snapshot_info(orte_sstore_base_global_snapshot_info_t *snapshot) { int ret, exit_status = ORTE_SUCCESS; int num_seqs, processes, i; - int *snapshot_ref_seqs; + char **snapshot_ref_seqs = NULL; opal_list_item_t* item = NULL; - orte_snapc_base_local_snapshot_t *vpid_snapshot; - - if (orte_restart_globals.seq_number == -1) { - if( ORTE_SUCCESS != (ret = orte_snapc_base_get_all_snapshot_ref_seqs(NULL, orte_restart_globals.filename, &num_seqs, &snapshot_ref_seqs) ) ) { - exit_status = ret; - goto cleanup; - } + orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL; + char *tmp_str = NULL; + + /* + * Find all sequence numbers + */ + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_NUM_SEQ, + &tmp_str); + num_seqs = atoi(tmp_str); + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + orte_sstore.get_attr(snapshot->ss_handle, + SSTORE_METADATA_GLOBAL_SNAP_ALL_SEQ, + &tmp_str); + snapshot_ref_seqs = opal_argv_split(tmp_str, ','); + if( NULL != tmp_str ) { + free(tmp_str); + tmp_str = NULL; + } + + if( 0 > orte_restart_globals.seq_number ) { opal_output(orte_restart_globals.output, "Sequences: %d\n", num_seqs); - } else { - num_seqs = 1; - snapshot_ref_seqs = &orte_restart_globals.seq_number; } - for (i=0; iseq_num = snapshot_ref_seqs[i]; + for(i=0; i < num_seqs; ++i) { + snapshot->seq_num = atoi(snapshot_ref_seqs[i]); - while (NULL != (item = opal_list_remove_first(&snapshot->local_snapshots))) { - OBJ_RELEASE(item); + if( 0 <= orte_restart_globals.seq_number && + snapshot->seq_num != orte_restart_globals.seq_number ) { + continue; } - if( NULL != snapshot->start_time ) { - free( snapshot->start_time ); - snapshot->start_time = NULL; - } - - if( NULL != snapshot->end_time ) { - free( snapshot->end_time ); - snapshot->end_time = NULL; + if( ORTE_SUCCESS != (ret = orte_sstore_base_extract_global_metadata( snapshot ) ) ) { + exit_status = ret; + goto cleanup; } opal_output(orte_restart_globals.output, "Seq: %d\n", snapshot->seq_num); - if( ORTE_SUCCESS != (ret = orte_snapc_base_extract_metadata( snapshot ) ) ) { - exit_status = ret; - goto cleanup; - } - - item = opal_list_get_first(&snapshot->local_snapshots); - vpid_snapshot = (orte_snapc_base_local_snapshot_t*)item; - if (NULL != snapshot->start_time ) { opal_output(orte_restart_globals.output, - "Begin Timestamp: %s\n", + "\tBegin Timestamp: %s\n", snapshot->start_time); } - - if (NULL != vpid_snapshot->opal_crs ) { - opal_output(orte_restart_globals.output, - "OPAL CRS Component: %s\n", - vpid_snapshot->opal_crs); - } - - if (NULL != snapshot->reference_name) { - opal_output(orte_restart_globals.output, - "Snapshot Reference: %s\n", - snapshot->reference_name); - } - - if (NULL != snapshot->local_location) { - opal_output(orte_restart_globals.output, - "Snapshot Location: %s\n", - snapshot->local_location); - } - if (NULL != snapshot->end_time ) { opal_output(orte_restart_globals.output, - "End Timestamp: %s\n", + "\tEnd Timestamp : %s\n", snapshot->end_time); } - processes = 0; - for(item = opal_list_get_first(&snapshot->local_snapshots); - item != opal_list_get_end(&snapshot->local_snapshots); - item = opal_list_get_next(item) ) { - processes++; - } + processes = opal_list_get_size(&snapshot->local_snapshots); opal_output(orte_restart_globals.output, - "Processes: %d\n", + "\tProcesses: %d\n", processes); for(item = opal_list_get_first(&snapshot->local_snapshots); item != opal_list_get_end(&snapshot->local_snapshots); item = opal_list_get_next(item) ) { - vpid_snapshot = (orte_snapc_base_local_snapshot_t*)item; + vpid_snapshot = (orte_sstore_base_local_snapshot_info_t*)item; opal_output_verbose(10, orte_restart_globals.output, - "Process: %u.%u", + "\t\tProcess: %u.%u \t CRS: %s \t Compress: %s (%s)", vpid_snapshot->process_name.jobid, - vpid_snapshot->process_name.vpid); + vpid_snapshot->process_name.vpid, + vpid_snapshot->crs_comp, + vpid_snapshot->compress_comp, + vpid_snapshot->compress_postfix); } } @@ -746,10 +834,10 @@ int snapshot_info(orte_snapc_base_global_snapshot_t *snapshot) static int snapshot_sort_compare_fn(opal_list_item_t **a, opal_list_item_t **b) { - orte_snapc_base_local_snapshot_t *snap_a, *snap_b; + orte_sstore_base_local_snapshot_info_t *snap_a, *snap_b; - snap_a = (orte_snapc_base_local_snapshot_t*)(*a); - snap_b = (orte_snapc_base_local_snapshot_t*)(*b); + snap_a = (orte_sstore_base_local_snapshot_info_t*)(*a); + snap_b = (orte_sstore_base_local_snapshot_info_t*)(*b); if( snap_a->process_name.vpid > snap_b->process_name.vpid ) { return 1; diff --git a/orte/tools/orterun/orterun.c b/orte/tools/orterun/orterun.c index 88d5abbaab..38b8cb384a 100644 --- a/orte/tools/orterun/orterun.c +++ b/orte/tools/orterun/orterun.c @@ -194,6 +194,13 @@ static opal_cmd_line_init_t cmd_line_init[] = { &orterun_globals.preload_files_dest_dir, OPAL_CMD_LINE_TYPE_STRING, "The destination directory to use in conjunction with --preload-files. By default the absolute and relative paths provided by --preload-files are used." }, +#if OPAL_ENABLE_FT_CR == 1 + /* Tell SStore to preload a snapshot before launch */ + { NULL, NULL, NULL, '\0', NULL, "sstore-load", 1, + &orterun_globals.sstore_load, OPAL_CMD_LINE_TYPE_STRING, + "Internal Use Only! Tell SStore to preload a snapshot before launch." }, +#endif + /* Use an appfile */ { NULL, NULL, NULL, '\0', NULL, "app", 1, &orterun_globals.appfile, OPAL_CMD_LINE_TYPE_STRING, @@ -426,6 +433,12 @@ static opal_cmd_line_init_t cmd_line_init[] = { NULL, OPAL_CMD_LINE_TYPE_INT, "Max number of times to locally restart a failed process before relocating it to a new node" }, +#if OPAL_ENABLE_CRDEBUG == 1 + { "opal", "cr", "enable_crdebug", '\0', "crdebug", "crdebug", 0, + NULL, OPAL_CMD_LINE_TYPE_BOOL, + "Enable C/R Debugging" }, +#endif + { NULL, NULL, NULL, '\0', "disable-recovery", "disable-recovery", 0, &orterun_globals.disable_recovery, OPAL_CMD_LINE_TYPE_BOOL, "Disable recovery (resets all recovery options to off)" }, @@ -810,6 +823,10 @@ static int init_globals(void) orterun_globals.preload_files = NULL; orterun_globals.preload_files_dest_dir = NULL; +#if OPAL_ENABLE_FT_CR == 1 + orterun_globals.sstore_load = NULL; +#endif + /* All done */ globals_init = true; return ORTE_SUCCESS; @@ -1580,6 +1597,13 @@ static int create_app(int argc, char* argv[], orte_app_context_t **app_ptr, else app->preload_files_dest_dir = NULL; +#if OPAL_ENABLE_FT_CR == 1 + if( NULL != orterun_globals.sstore_load ) { + app->sstore_load = strdup(orterun_globals.sstore_load); + } else { + app->sstore_load = NULL; + } +#endif /* Do not try to find argv[0] here -- the starter is responsible for that because it may not be relevant to try to find it on diff --git a/orte/tools/orterun/orterun.h b/orte/tools/orterun/orterun.h index 1ab5112caa..f783100cbc 100644 --- a/orte/tools/orterun/orterun.h +++ b/orte/tools/orterun/orterun.h @@ -1,5 +1,5 @@ /* - * Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana + * Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. * Copyright (c) 2004-2005 The University of Tennessee and The University @@ -63,6 +63,9 @@ struct orterun_globals_t { bool wait_for_server; int server_wait_timeout; char *stdin_target; +#if OPAL_ENABLE_FT_CR == 1 + char *sstore_load; +#endif bool disable_recovery; };