A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
Этот коммит содержится в:
родитель
9fff01704f
Коммит
e12ca48cd9
@ -1,5 +1,5 @@
|
|||||||
#
|
#
|
||||||
# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
# University Research and Technology
|
# University Research and Technology
|
||||||
# Corporation. All rights reserved.
|
# Corporation. All rights reserved.
|
||||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -22,7 +22,9 @@ amca_paramdir = $(AMCA_PARAM_SETS_DIR)
|
|||||||
dist_amca_param_DATA = amca-param-sets/example.conf
|
dist_amca_param_DATA = amca-param-sets/example.conf
|
||||||
|
|
||||||
if WANT_FT
|
if WANT_FT
|
||||||
dist_amca_param_DATA += amca-param-sets/ft-enable-cr
|
dist_amca_param_DATA += \
|
||||||
|
amca-param-sets/ft-enable-cr \
|
||||||
|
amca-param-sets/ft-enable-cr-recovery
|
||||||
endif
|
endif
|
||||||
|
|
||||||
EXTRA_DIST = \
|
EXTRA_DIST = \
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
#
|
#
|
||||||
# Copyright (c) 2008-2009 The Trustees of Indiana University and Indiana
|
# Copyright (c) 2008-2010 The Trustees of Indiana University and Indiana
|
||||||
# University Research and Technology
|
# University Research and Technology
|
||||||
# Corporation. All rights reserved.
|
# Corporation. All rights reserved.
|
||||||
#
|
#
|
||||||
@ -37,7 +37,6 @@ opal_cr_use_thread=1
|
|||||||
#
|
#
|
||||||
rml_wrapper=ftrm
|
rml_wrapper=ftrm
|
||||||
snapc=full
|
snapc=full
|
||||||
#filem=rsh
|
|
||||||
|
|
||||||
#
|
#
|
||||||
# OMPI Parameters
|
# OMPI Parameters
|
||||||
|
82
contrib/amca-param-sets/ft-enable-cr-recovery
Обычный файл
82
contrib/amca-param-sets/ft-enable-cr-recovery
Обычный файл
@ -0,0 +1,82 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
# An Aggregate MCA Parameter Set to enable checkpoint/restart capabilities
|
||||||
|
# for a job.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# shell$ mpirun -am ft-enable-cr ./app
|
||||||
|
#
|
||||||
|
|
||||||
|
#
|
||||||
|
# OPAL Parameters
|
||||||
|
# - Turn off OPAL only checkpointing
|
||||||
|
# - Select only checkpoint ready components
|
||||||
|
# - Enable Additional FT infrastructure
|
||||||
|
# - Auto-select OPAL CRS component
|
||||||
|
# - If available, use the FT Thread (Default)
|
||||||
|
#
|
||||||
|
opal_cr_allow_opal_only=0
|
||||||
|
mca_base_component_distill_checkpoint_ready=1
|
||||||
|
ft_cr_enabled=1
|
||||||
|
crs=
|
||||||
|
opal_cr_use_thread=1
|
||||||
|
|
||||||
|
#
|
||||||
|
# ORTE Parameters
|
||||||
|
# - Wrap the RML
|
||||||
|
# - Use the 'full' Snapshot Coordinator
|
||||||
|
# - Use the 'cm' routed component. It is the only one that is currently able to
|
||||||
|
# handle process and daemon loss.
|
||||||
|
#
|
||||||
|
rml_wrapper=ftrm
|
||||||
|
snapc=full
|
||||||
|
routed=cm
|
||||||
|
|
||||||
|
#
|
||||||
|
# OMPI Parameters
|
||||||
|
# - Wrap the PML
|
||||||
|
# - Use a Bookmark Exchange Fully Coordinated Checkpoint/Restart Coordination Protocol
|
||||||
|
#
|
||||||
|
pml_wrapper=crcpw
|
||||||
|
crcp=bkmrk
|
||||||
|
|
||||||
|
#
|
||||||
|
# Temporary fix to force the event engine to use poll to behave well with BLCR
|
||||||
|
#
|
||||||
|
opal_event_include=poll
|
||||||
|
|
||||||
|
#
|
||||||
|
# We currently only support the following options to the OpenIB BTL
|
||||||
|
# Future development will attempt to eliminate many of these restrictions
|
||||||
|
#
|
||||||
|
btl_openib_want_fork_support=1
|
||||||
|
btl_openib_use_async_event_thread=0
|
||||||
|
btl_openib_use_eager_rdma=0
|
||||||
|
btl_openib_cpc_include=oob
|
||||||
|
|
||||||
|
# Enable SIGTSTP/SIGCONT capability
|
||||||
|
# killall -TSTP mpirun
|
||||||
|
# killall -CONT mpirun
|
||||||
|
orte_forward_job_control=1
|
||||||
|
|
||||||
|
#
|
||||||
|
# Use the C/R Error Management and Recovery Service
|
||||||
|
#
|
||||||
|
orte_enable_recovery=1
|
||||||
|
orte_max_global_restarts=10
|
||||||
|
errmgr_crmig_enable=1
|
||||||
|
errmgr_autor_enable=1
|
||||||
|
|
||||||
|
#
|
||||||
|
# Additional constraints to be lifted in the future
|
||||||
|
#
|
||||||
|
plm=rsh
|
||||||
|
rmaps=resilient
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
||||||
@ -54,7 +54,7 @@ int mca_bml_r2_ft_event(int state)
|
|||||||
first_continue_pass = !first_continue_pass;
|
first_continue_pass = !first_continue_pass;
|
||||||
|
|
||||||
/* Since nothing in Checkpoint, we are fine here (unless required by BTL) */
|
/* Since nothing in Checkpoint, we are fine here (unless required by BTL) */
|
||||||
if( ompi_cr_continue_like_restart && !first_continue_pass) {
|
if( orte_cr_continue_like_restart && !first_continue_pass) {
|
||||||
procs = ompi_proc_all(&num_procs);
|
procs = ompi_proc_all(&num_procs);
|
||||||
if(NULL == procs) {
|
if(NULL == procs) {
|
||||||
return OMPI_ERR_OUT_OF_RESOURCE;
|
return OMPI_ERR_OUT_OF_RESOURCE;
|
||||||
@ -136,7 +136,7 @@ int mca_bml_r2_ft_event(int state)
|
|||||||
}
|
}
|
||||||
else if(OPAL_CRS_CONTINUE == state) {
|
else if(OPAL_CRS_CONTINUE == state) {
|
||||||
/* Matches OPAL_CRS_RESTART_PRE */
|
/* Matches OPAL_CRS_RESTART_PRE */
|
||||||
if( ompi_cr_continue_like_restart && first_continue_pass) {
|
if( orte_cr_continue_like_restart && first_continue_pass) {
|
||||||
if( OMPI_SUCCESS != (ret = mca_bml_r2_finalize()) ) {
|
if( OMPI_SUCCESS != (ret = mca_bml_r2_finalize()) ) {
|
||||||
opal_output(0, "bml:r2: ft_event(Restart): Failed to finalize BML framework\n");
|
opal_output(0, "bml:r2: ft_event(Restart): Failed to finalize BML framework\n");
|
||||||
return ret;
|
return ret;
|
||||||
@ -147,7 +147,7 @@ int mca_bml_r2_ft_event(int state)
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
/* Matches OPAL_CRS_RESTART */
|
/* Matches OPAL_CRS_RESTART */
|
||||||
else if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
else if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||||
/*
|
/*
|
||||||
* Barrier to make all processes have been successfully restarted before
|
* Barrier to make all processes have been successfully restarted before
|
||||||
* we try to remove some restart only files.
|
* we try to remove some restart only files.
|
||||||
@ -157,10 +157,6 @@ int mca_bml_r2_ft_event(int state)
|
|||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
|
|
||||||
opal_output_verbose(10, ompi_cr_output,
|
|
||||||
"bml:r2: ft_event(Restart): Cleanup restart files\n");
|
|
||||||
opal_crs_base_cleanup_flush();
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Re-open the BTL framework to get the full list of components.
|
* Re-open the BTL framework to get the full list of components.
|
||||||
*/
|
*/
|
||||||
@ -234,10 +230,6 @@ int mca_bml_r2_ft_event(int state)
|
|||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
|
|
||||||
opal_output_verbose(10, ompi_cr_output,
|
|
||||||
"bml:r2: ft_event(Restart): Cleanup restart files\n");
|
|
||||||
opal_crs_base_cleanup_flush();
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Re-open the BTL framework to get the full list of components.
|
* Re-open the BTL framework to get the full list of components.
|
||||||
* - but first clear the MCA value that was there
|
* - but first clear the MCA value that was there
|
||||||
|
@ -641,7 +641,7 @@ int mca_btl_mx_ft_event(int state) {
|
|||||||
* kernel: blcr: thaw_threads returned error, aborting. -1
|
* kernel: blcr: thaw_threads returned error, aborting. -1
|
||||||
* JJH: It may be possible to, instead of restarting the entire driver, just reconnect endpoints
|
* JJH: It may be possible to, instead of restarting the entire driver, just reconnect endpoints
|
||||||
*/
|
*/
|
||||||
ompi_cr_continue_like_restart = true;
|
orte_cr_continue_like_restart = true;
|
||||||
|
|
||||||
for( i = 0; i < mca_btl_mx_component.mx_num_btls; i++ ) {
|
for( i = 0; i < mca_btl_mx_component.mx_num_btls; i++ ) {
|
||||||
mx_btl = mca_btl_mx_component.mx_btls[i];
|
mx_btl = mca_btl_mx_component.mx_btls[i];
|
||||||
|
@ -1735,7 +1735,7 @@ int mca_btl_openib_ft_event(int state) {
|
|||||||
if(OPAL_CRS_CHECKPOINT == state) {
|
if(OPAL_CRS_CHECKPOINT == state) {
|
||||||
/* Continue must reconstruct the routes (including modex), since we
|
/* Continue must reconstruct the routes (including modex), since we
|
||||||
* have to tear down the devices completely. */
|
* have to tear down the devices completely. */
|
||||||
ompi_cr_continue_like_restart = true;
|
orte_cr_continue_like_restart = true;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* To keep the node from crashing we need to call ibv_close_device
|
* To keep the node from crashing we need to call ibv_close_device
|
||||||
|
@ -52,6 +52,7 @@
|
|||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
#include "opal/mca/crs/base/base.h"
|
#include "opal/mca/crs/base/base.h"
|
||||||
#include "opal/util/basename.h"
|
#include "opal/util/basename.h"
|
||||||
|
#include "orte/mca/sstore/sstore.h"
|
||||||
#include "ompi/runtime/ompi_cr.h"
|
#include "ompi/runtime/ompi_cr.h"
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
@ -1099,8 +1100,6 @@ int mca_btl_sm_ft_event(int state) {
|
|||||||
}
|
}
|
||||||
#else
|
#else
|
||||||
int mca_btl_sm_ft_event(int state) {
|
int mca_btl_sm_ft_event(int state) {
|
||||||
char * tmp_dir = NULL;
|
|
||||||
|
|
||||||
/* Notify mpool */
|
/* Notify mpool */
|
||||||
if( NULL != mca_btl_sm_component.sm_mpool &&
|
if( NULL != mca_btl_sm_component.sm_mpool &&
|
||||||
NULL != mca_btl_sm_component.sm_mpool->mpool_ft_event) {
|
NULL != mca_btl_sm_component.sm_mpool->mpool_ft_event) {
|
||||||
@ -1114,17 +1113,14 @@ int mca_btl_sm_ft_event(int state) {
|
|||||||
* for these old file handles. The restart procedure will make sure
|
* for these old file handles. The restart procedure will make sure
|
||||||
* these files get cleaned up appropriately.
|
* these files get cleaned up appropriately.
|
||||||
*/
|
*/
|
||||||
opal_crs_base_metadata_write_token(NULL, CRS_METADATA_TOUCH, mca_btl_sm_component.sm_seg->module_seg_path);
|
orte_sstore.set_attr(orte_sstore_handle_current,
|
||||||
|
SSTORE_METADATA_LOCAL_TOUCH,
|
||||||
/* Record the job session directory */
|
mca_btl_sm_component.sm_seg->module_seg_path);
|
||||||
opal_crs_base_metadata_write_token(NULL, CRS_METADATA_MKDIR, orte_process_info.job_session_dir);
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
else if(OPAL_CRS_CONTINUE == state) {
|
else if(OPAL_CRS_CONTINUE == state) {
|
||||||
if( ompi_cr_continue_like_restart ) {
|
if( orte_cr_continue_like_restart ) {
|
||||||
if( NULL != mca_btl_sm_component.sm_seg ) {
|
if( NULL != mca_btl_sm_component.sm_seg ) {
|
||||||
/* Do not Add session directory on continue */
|
|
||||||
|
|
||||||
/* Add shared memory file */
|
/* Add shared memory file */
|
||||||
opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false);
|
opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false);
|
||||||
}
|
}
|
||||||
@ -1136,14 +1132,6 @@ int mca_btl_sm_ft_event(int state) {
|
|||||||
else if(OPAL_CRS_RESTART == state ||
|
else if(OPAL_CRS_RESTART == state ||
|
||||||
OPAL_CRS_RESTART_PRE == state) {
|
OPAL_CRS_RESTART_PRE == state) {
|
||||||
if( NULL != mca_btl_sm_component.sm_seg ) {
|
if( NULL != mca_btl_sm_component.sm_seg ) {
|
||||||
/* Add session directory */
|
|
||||||
opal_crs_base_cleanup_append(orte_process_info.job_session_dir, true);
|
|
||||||
tmp_dir = opal_dirname(orte_process_info.job_session_dir);
|
|
||||||
if( NULL != tmp_dir ) {
|
|
||||||
opal_crs_base_cleanup_append(tmp_dir, true);
|
|
||||||
free(tmp_dir);
|
|
||||||
tmp_dir = NULL;
|
|
||||||
}
|
|
||||||
/* Add shared memory file */
|
/* Add shared memory file */
|
||||||
opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false);
|
opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false);
|
||||||
}
|
}
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
#
|
#
|
||||||
# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
# University Research and Technology
|
# University Research and Technology
|
||||||
# Corporation. All rights reserved.
|
# Corporation. All rights reserved.
|
||||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -26,3 +26,4 @@ libmca_crcp_la_SOURCES += \
|
|||||||
base/crcp_base_close.c \
|
base/crcp_base_close.c \
|
||||||
base/crcp_base_select.c \
|
base/crcp_base_select.c \
|
||||||
base/crcp_base_fns.c
|
base/crcp_base_fns.c
|
||||||
|
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -60,6 +60,12 @@ BEGIN_C_DECLS
|
|||||||
*/
|
*/
|
||||||
OMPI_DECLSPEC int ompi_crcp_base_close(void);
|
OMPI_DECLSPEC int ompi_crcp_base_close(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Quiesce Interface (For MPI Ext.)
|
||||||
|
*/
|
||||||
|
OMPI_DECLSPEC int ompi_crcp_base_quiesce_start(MPI_Info *info);
|
||||||
|
OMPI_DECLSPEC int ompi_crcp_base_quiesce_end(MPI_Info *info);
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* 'None' component functions
|
* 'None' component functions
|
||||||
* These are to be used when no component is selected.
|
* These are to be used when no component is selected.
|
||||||
@ -72,6 +78,10 @@ BEGIN_C_DECLS
|
|||||||
int ompi_crcp_base_module_init(void);
|
int ompi_crcp_base_module_init(void);
|
||||||
int ompi_crcp_base_module_finalize(void);
|
int ompi_crcp_base_module_finalize(void);
|
||||||
|
|
||||||
|
/* Quiesce Interface */
|
||||||
|
int ompi_crcp_base_none_quiesce_start(MPI_Info *info);
|
||||||
|
int ompi_crcp_base_none_quiesce_end(MPI_Info *info);
|
||||||
|
|
||||||
/* PML Interface */
|
/* PML Interface */
|
||||||
ompi_crcp_base_pml_state_t* ompi_crcp_base_none_pml_enable( bool enable, ompi_crcp_base_pml_state_t* );
|
ompi_crcp_base_pml_state_t* ompi_crcp_base_none_pml_enable( bool enable, ompi_crcp_base_pml_state_t* );
|
||||||
|
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -38,6 +38,7 @@
|
|||||||
#include "ompi/mca/crcp/crcp.h"
|
#include "ompi/mca/crcp/crcp.h"
|
||||||
#include "ompi/mca/crcp/base/base.h"
|
#include "ompi/mca/crcp/base/base.h"
|
||||||
#include "ompi/mca/bml/base/base.h"
|
#include "ompi/mca/bml/base/base.h"
|
||||||
|
#include "ompi/info/info.h"
|
||||||
#include "ompi/mca/pml/pml.h"
|
#include "ompi/mca/pml/pml.h"
|
||||||
#include "ompi/mca/pml/base/base.h"
|
#include "ompi/mca/pml/base/base.h"
|
||||||
#include "ompi/mca/pml/base/pml_base_request.h"
|
#include "ompi/mca/pml/base/pml_base_request.h"
|
||||||
@ -92,6 +93,19 @@ int ompi_crcp_base_module_finalize(void)
|
|||||||
return OMPI_SUCCESS;
|
return OMPI_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/****************
|
||||||
|
* MPI Quiesce Interface
|
||||||
|
****************/
|
||||||
|
int ompi_crcp_base_none_quiesce_start(MPI_Info *info)
|
||||||
|
{
|
||||||
|
return OMPI_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int ompi_crcp_base_none_quiesce_end(MPI_Info *info)
|
||||||
|
{
|
||||||
|
return OMPI_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
/****************
|
/****************
|
||||||
* PML Wrapper
|
* PML Wrapper
|
||||||
****************/
|
****************/
|
||||||
@ -397,3 +411,24 @@ ompi_crcp_base_none_btl_ft_event(int state,
|
|||||||
/********************
|
/********************
|
||||||
* Utility functions
|
* Utility functions
|
||||||
********************/
|
********************/
|
||||||
|
|
||||||
|
/******************
|
||||||
|
* MPI Interface Functions
|
||||||
|
******************/
|
||||||
|
int ompi_crcp_base_quiesce_start(MPI_Info *info)
|
||||||
|
{
|
||||||
|
if( NULL != ompi_crcp.quiesce_start ) {
|
||||||
|
return ompi_crcp.quiesce_start(info);
|
||||||
|
} else {
|
||||||
|
return OMPI_SUCCESS;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
int ompi_crcp_base_quiesce_end(MPI_Info *info)
|
||||||
|
{
|
||||||
|
if( NULL != ompi_crcp.quiesce_end ) {
|
||||||
|
return ompi_crcp.quiesce_end(info);
|
||||||
|
} else {
|
||||||
|
return OMPI_SUCCESS;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -63,6 +63,10 @@ static ompi_crcp_base_module_t none_module = {
|
|||||||
/** Finalization Function */
|
/** Finalization Function */
|
||||||
ompi_crcp_base_module_finalize,
|
ompi_crcp_base_module_finalize,
|
||||||
|
|
||||||
|
/** Quiesce interface */
|
||||||
|
ompi_crcp_base_none_quiesce_start,
|
||||||
|
ompi_crcp_base_none_quiesce_end,
|
||||||
|
|
||||||
/** PML Wrapper */
|
/** PML Wrapper */
|
||||||
ompi_crcp_base_none_pml_enable,
|
ompi_crcp_base_none_pml_enable,
|
||||||
|
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -57,6 +57,12 @@ BEGIN_C_DECLS
|
|||||||
int ompi_crcp_bkmrk_pml_init(void);
|
int ompi_crcp_bkmrk_pml_init(void);
|
||||||
int ompi_crcp_bkmrk_pml_finalize(void);
|
int ompi_crcp_bkmrk_pml_finalize(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Quiesce Interface
|
||||||
|
*/
|
||||||
|
int ompi_crcp_bkmrk_quiesce_start(MPI_Info *info);
|
||||||
|
int ompi_crcp_bkmrk_quiesce_end(MPI_Info *info);
|
||||||
|
|
||||||
END_C_DECLS
|
END_C_DECLS
|
||||||
|
|
||||||
#endif /* MCA_CRCP_HOKE_EXPORT_H */
|
#endif /* MCA_CRCP_HOKE_EXPORT_H */
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2009 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -44,6 +44,10 @@ static ompi_crcp_base_module_t loc_module = {
|
|||||||
/** Finalization Function */
|
/** Finalization Function */
|
||||||
ompi_crcp_bkmrk_module_finalize,
|
ompi_crcp_bkmrk_module_finalize,
|
||||||
|
|
||||||
|
/** Quiesce interface */
|
||||||
|
ompi_crcp_bkmrk_quiesce_start,
|
||||||
|
ompi_crcp_bkmrk_quiesce_end,
|
||||||
|
|
||||||
/** PML Wrapper */
|
/** PML Wrapper */
|
||||||
NULL, /* ompi_crcp_bkmrk_pml_enable, */
|
NULL, /* ompi_crcp_bkmrk_pml_enable, */
|
||||||
|
|
||||||
@ -131,6 +135,34 @@ int ompi_crcp_bkmrk_module_finalize(void)
|
|||||||
return OMPI_SUCCESS;
|
return OMPI_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
int ompi_crcp_bkmrk_quiesce_start(MPI_Info *info)
|
||||||
|
{
|
||||||
|
OPAL_OUTPUT_VERBOSE((10, mca_crcp_bkmrk_component.super.output_handle,
|
||||||
|
"crcp:bkmrk: quiesce_start(--)"));
|
||||||
|
#if 0
|
||||||
|
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_start(QUIESCE_TAG_CKPT)) ) {
|
||||||
|
;
|
||||||
|
}
|
||||||
|
return OMPI_SUCCESS;
|
||||||
|
#else
|
||||||
|
return OMPI_ERR_NOT_IMPLEMENTED;
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
int ompi_crcp_bkmrk_quiesce_end(MPI_Info *info)
|
||||||
|
{
|
||||||
|
OPAL_OUTPUT_VERBOSE((10, mca_crcp_bkmrk_component.super.output_handle,
|
||||||
|
"crcp:bkmrk: quiesce_end(--)"));
|
||||||
|
#if 0
|
||||||
|
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_end(QUIESCE_TAG_CONTINUE) ) ) {
|
||||||
|
;
|
||||||
|
}
|
||||||
|
return OMPI_SUCCESS;
|
||||||
|
#else
|
||||||
|
return OMPI_ERR_NOT_IMPLEMENTED;
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
/******************
|
/******************
|
||||||
* Local functions
|
* Local functions
|
||||||
******************/
|
******************/
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2009 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2010 The University of Tennessee and The University
|
* Copyright (c) 2010 The University of Tennessee and The University
|
||||||
* of Tennessee Research Foundation. All rights
|
* of Tennessee Research Foundation. All rights
|
||||||
@ -2986,6 +2986,26 @@ int ompi_crcp_bkmrk_request_complete(struct ompi_request_t *request)
|
|||||||
}
|
}
|
||||||
|
|
||||||
/**************** FT Event *****************/
|
/**************** FT Event *****************/
|
||||||
|
int ompi_crcp_bkmrk_pml_quiesce_start(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ) {
|
||||||
|
int ret, exit_status = OMPI_SUCCESS;
|
||||||
|
|
||||||
|
if( OMPI_SUCCESS != (ret = ft_event_coordinate_peers()) ) {
|
||||||
|
exit_status = ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
int ompi_crcp_bkmrk_pml_quiesce_end(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ) {
|
||||||
|
int ret, exit_status = OMPI_SUCCESS;
|
||||||
|
|
||||||
|
if( OMPI_SUCCESS != (ret = ft_event_finalize_exchange() ) ) {
|
||||||
|
exit_status = ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
||||||
int state,
|
int state,
|
||||||
ompi_crcp_base_pml_state_t* pml_state)
|
ompi_crcp_base_pml_state_t* pml_state)
|
||||||
@ -3027,7 +3047,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
|||||||
* When we return from this function we know that all of our
|
* When we return from this function we know that all of our
|
||||||
* channels have been flushed.
|
* channels have been flushed.
|
||||||
*/
|
*/
|
||||||
if( OMPI_SUCCESS != (ret = ft_event_coordinate_peers()) ) {
|
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_start(QUIESCE_TAG_CKPT)) ) {
|
||||||
opal_output(mca_crcp_bkmrk_component.super.output_handle,
|
opal_output(mca_crcp_bkmrk_component.super.output_handle,
|
||||||
"crcp:bkmrk: %s ft_event: Checkpoint Coordination Failed %d",
|
"crcp:bkmrk: %s ft_event: Checkpoint Coordination Failed %d",
|
||||||
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||||
@ -3060,7 +3080,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
|||||||
first_continue_pass = !first_continue_pass;
|
first_continue_pass = !first_continue_pass;
|
||||||
|
|
||||||
/* Only finalize the Protocol after the PML has been rebuilt */
|
/* Only finalize the Protocol after the PML has been rebuilt */
|
||||||
if( ompi_cr_continue_like_restart && first_continue_pass ) {
|
if( orte_cr_continue_like_restart && first_continue_pass ) {
|
||||||
goto DONE;
|
goto DONE;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -3069,7 +3089,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
|||||||
/*
|
/*
|
||||||
* Finish the coord protocol
|
* Finish the coord protocol
|
||||||
*/
|
*/
|
||||||
if( OMPI_SUCCESS != (ret = ft_event_finalize_exchange() ) ) {
|
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_end(QUIESCE_TAG_CONTINUE) ) ) {
|
||||||
opal_output(mca_crcp_bkmrk_component.super.output_handle,
|
opal_output(mca_crcp_bkmrk_component.super.output_handle,
|
||||||
"crcp:bkmrk: pml_ft_event: Checkpoint Finalization Failed %d",
|
"crcp:bkmrk: pml_ft_event: Checkpoint Finalization Failed %d",
|
||||||
ret);
|
ret);
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2007 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -116,6 +116,18 @@ BEGIN_C_DECLS
|
|||||||
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event
|
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event
|
||||||
(int state, ompi_crcp_base_pml_state_t* pml_state);
|
(int state, ompi_crcp_base_pml_state_t* pml_state);
|
||||||
|
|
||||||
|
enum ompi_crcp_bkmrk_pml_quiesce_tag_type_t {
|
||||||
|
QUIESCE_TAG_NONE = 0, /* 0 No tag specified */
|
||||||
|
QUIESCE_TAG_CKPT, /* 1 Prepare for checkpoint */
|
||||||
|
QUIESCE_TAG_CONTINUE, /* 2 Continue after a checkpoint */
|
||||||
|
QUIESCE_TAG_RESTART, /* 3 Restart from a checkpoint */
|
||||||
|
QUIESCE_TAG_UNKNOWN /* 4 Unknown */
|
||||||
|
};
|
||||||
|
typedef enum ompi_crcp_bkmrk_pml_quiesce_tag_type_t ompi_crcp_bkmrk_pml_quiesce_tag_type_t;
|
||||||
|
|
||||||
|
int ompi_crcp_bkmrk_pml_quiesce_start(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag );
|
||||||
|
int ompi_crcp_bkmrk_pml_quiesce_end(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag );
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Request function
|
* Request function
|
||||||
*/
|
*/
|
||||||
|
@ -61,6 +61,23 @@ typedef int (*ompi_crcp_base_module_init_fn_t)
|
|||||||
typedef int (*ompi_crcp_base_module_finalize_fn_t)
|
typedef int (*ompi_crcp_base_module_finalize_fn_t)
|
||||||
(void);
|
(void);
|
||||||
|
|
||||||
|
|
||||||
|
/************************
|
||||||
|
* MPI Quiesce Interface
|
||||||
|
************************/
|
||||||
|
/**
|
||||||
|
* MPI_Quiesce_start component interface
|
||||||
|
*/
|
||||||
|
typedef int (*ompi_crcp_base_quiesce_start_fn_t)
|
||||||
|
(MPI_Info *info);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* MPI_Quiesce_end component interface
|
||||||
|
*/
|
||||||
|
typedef int (*ompi_crcp_base_quiesce_end_fn_t)
|
||||||
|
(MPI_Info *info);
|
||||||
|
|
||||||
|
|
||||||
/************************
|
/************************
|
||||||
* PML Wrapper hooks
|
* PML Wrapper hooks
|
||||||
* PML Wrapper is the CRCPW PML component
|
* PML Wrapper is the CRCPW PML component
|
||||||
@ -283,6 +300,10 @@ struct ompi_crcp_base_module_1_0_0_t {
|
|||||||
/** Finalization Function */
|
/** Finalization Function */
|
||||||
ompi_crcp_base_module_finalize_fn_t crcp_finalize;
|
ompi_crcp_base_module_finalize_fn_t crcp_finalize;
|
||||||
|
|
||||||
|
/**< MPI_Quiesce Interface Functions ******************/
|
||||||
|
ompi_crcp_base_quiesce_start_fn_t quiesce_start;
|
||||||
|
ompi_crcp_base_quiesce_end_fn_t quiesce_end;
|
||||||
|
|
||||||
/**< PML Wrapper Functions ****************************/
|
/**< PML Wrapper Functions ****************************/
|
||||||
ompi_crcp_base_pml_enable_fn_t pml_enable;
|
ompi_crcp_base_pml_enable_fn_t pml_enable;
|
||||||
|
|
||||||
|
@ -32,6 +32,7 @@
|
|||||||
#include "orte/util/proc_info.h"
|
#include "orte/util/proc_info.h"
|
||||||
|
|
||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
#include "orte/mca/sstore/sstore.h"
|
||||||
#include "ompi/mca/mpool/base/base.h"
|
#include "ompi/mca/mpool/base/base.h"
|
||||||
#include "ompi/runtime/ompi_cr.h"
|
#include "ompi/runtime/ompi_cr.h"
|
||||||
#endif
|
#endif
|
||||||
@ -169,12 +170,12 @@ int mca_mpool_sm_ft_event(int state) {
|
|||||||
asprintf( &file_name, "%s"OPAL_PATH_SEP"shared_mem_pool.%s",
|
asprintf( &file_name, "%s"OPAL_PATH_SEP"shared_mem_pool.%s",
|
||||||
orte_process_info.job_session_dir,
|
orte_process_info.job_session_dir,
|
||||||
orte_process_info.nodename );
|
orte_process_info.nodename );
|
||||||
opal_crs_base_metadata_write_token(NULL, CRS_METADATA_TOUCH, file_name);
|
orte_sstore.set_attr(orte_sstore_handle_current, SSTORE_METADATA_LOCAL_TOUCH, file_name);
|
||||||
free(file_name);
|
free(file_name);
|
||||||
file_name = NULL;
|
file_name = NULL;
|
||||||
}
|
}
|
||||||
else if(OPAL_CRS_CONTINUE == state) {
|
else if(OPAL_CRS_CONTINUE == state) {
|
||||||
if(ompi_cr_continue_like_restart) {
|
if(orte_cr_continue_like_restart) {
|
||||||
/* Find the sm module */
|
/* Find the sm module */
|
||||||
self_module = mca_mpool_base_module_lookup("sm");
|
self_module = mca_mpool_base_module_lookup("sm");
|
||||||
self_sm_module = (mca_mpool_sm_module_t*) self_module;
|
self_sm_module = (mca_mpool_sm_module_t*) self_module;
|
||||||
|
@ -691,7 +691,7 @@ int mca_pml_bfo_ft_event( int state )
|
|||||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
||||||
}
|
}
|
||||||
|
|
||||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||||
/*
|
/*
|
||||||
* Get a list of processes
|
* Get a list of processes
|
||||||
*/
|
*/
|
||||||
@ -791,7 +791,7 @@ int mca_pml_bfo_ft_event( int state )
|
|||||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
||||||
}
|
}
|
||||||
|
|
||||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||||
/*
|
/*
|
||||||
* Exchange the modex information once again.
|
* Exchange the modex information once again.
|
||||||
* BTLs will have republished their modex information.
|
* BTLs will have republished their modex information.
|
||||||
|
@ -669,7 +669,7 @@ int mca_pml_csum_ft_event( int state )
|
|||||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
||||||
}
|
}
|
||||||
|
|
||||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||||
/*
|
/*
|
||||||
* Get a list of processes
|
* Get a list of processes
|
||||||
*/
|
*/
|
||||||
@ -769,7 +769,7 @@ int mca_pml_csum_ft_event( int state )
|
|||||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
||||||
}
|
}
|
||||||
|
|
||||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||||
/*
|
/*
|
||||||
* Exchange the modex information once again.
|
* Exchange the modex information once again.
|
||||||
* BTLs will have republished their modex information.
|
* BTLs will have republished their modex information.
|
||||||
|
@ -638,7 +638,7 @@ int mca_pml_ob1_ft_event( int state )
|
|||||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
||||||
}
|
}
|
||||||
|
|
||||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||||
/*
|
/*
|
||||||
* Get a list of processes
|
* Get a list of processes
|
||||||
*/
|
*/
|
||||||
@ -738,7 +738,7 @@ int mca_pml_ob1_ft_event( int state )
|
|||||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
||||||
}
|
}
|
||||||
|
|
||||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||||
/*
|
/*
|
||||||
* Exchange the modex information once again.
|
* Exchange the modex information once again.
|
||||||
* BTLs will have republished their modex information.
|
* BTLs will have republished their modex information.
|
||||||
|
38
ompi/mpiext/cr/Makefile.am
Обычный файл
38
ompi/mpiext/cr/Makefile.am
Обычный файл
@ -0,0 +1,38 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
headers = \
|
||||||
|
mpiext_cr_c.h
|
||||||
|
|
||||||
|
sources = \
|
||||||
|
c/checkpoint.c \
|
||||||
|
c/restart.c \
|
||||||
|
c/migrate.c \
|
||||||
|
c/inc_register_callback.c \
|
||||||
|
c/quiesce_start.c \
|
||||||
|
c/quiesce_end.c \
|
||||||
|
c/quiesce_checkpoint.c \
|
||||||
|
c/self_register_checkpoint.c \
|
||||||
|
c/self_register_restart.c \
|
||||||
|
c/self_register_continue.c
|
||||||
|
|
||||||
|
lib = libext_mpiext_cr.la
|
||||||
|
lib_sources = $(sources)
|
||||||
|
|
||||||
|
extcomponentdir = $(pkglibdir)
|
||||||
|
|
||||||
|
noinst_LTLIBRARIES = $(lib)
|
||||||
|
libext_mpiext_cr_la_SOURCES = $(lib_sources)
|
||||||
|
libext_mpiext_cr_la_LDFLAGS = -module -avoid-version
|
||||||
|
|
||||||
|
ompidir = $(includedir)/openmpi/ompi/mpiext/cr
|
||||||
|
ompi_HEADERS = \
|
||||||
|
$(headers)
|
88
ompi/mpiext/cr/c/checkpoint.c
Обычный файл
88
ompi/mpiext/cr/c/checkpoint.c
Обычный файл
@ -0,0 +1,88 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "ompi/info/info.h"
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_Checkpoint";
|
||||||
|
#define HANDLE_SIZE_MAX 256
|
||||||
|
|
||||||
|
int OMPI_CR_Checkpoint(char **handle, int *seq, MPI_Info *info)
|
||||||
|
{
|
||||||
|
int ret = MPI_SUCCESS;
|
||||||
|
MPI_Comm comm = MPI_COMM_WORLD;
|
||||||
|
orte_snapc_base_request_op_t *datum = NULL;
|
||||||
|
int state = 0;
|
||||||
|
int my_rank;
|
||||||
|
|
||||||
|
/* argument checking */
|
||||||
|
if (MPI_PARAM_CHECK) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Setup the data structure for the operation
|
||||||
|
*/
|
||||||
|
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||||
|
datum->event = ORTE_SNAPC_OP_CHECKPOINT;
|
||||||
|
datum->is_active = true;
|
||||||
|
|
||||||
|
MPI_Comm_rank(comm, &my_rank);
|
||||||
|
if( 0 == my_rank ) {
|
||||||
|
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||||
|
} else {
|
||||||
|
datum->leader = -1; /* Unknown from non-root ranks */
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* All processes must make this call before it can start
|
||||||
|
*/
|
||||||
|
MPI_Barrier(comm);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Leader sends the request
|
||||||
|
*/
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
ret = orte_snapc.request_op(datum);
|
||||||
|
if( OMPI_SUCCESS != ret ) {
|
||||||
|
OBJ_RELEASE(datum);
|
||||||
|
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||||
|
FUNC_NAME);
|
||||||
|
}
|
||||||
|
OPAL_CR_EXIT_LIBRARY();
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Leader then sends out the commit message
|
||||||
|
*/
|
||||||
|
if( datum->leader == (int)ORTE_PROC_MY_NAME->vpid ) {
|
||||||
|
*handle = strdup(datum->global_handle);
|
||||||
|
*seq = datum->seq_num;
|
||||||
|
state = 0;
|
||||||
|
} else {
|
||||||
|
*handle = (char*)malloc(sizeof(char)*HANDLE_SIZE_MAX);
|
||||||
|
}
|
||||||
|
|
||||||
|
MPI_Bcast(&state, 1, MPI_INT, 0, comm);
|
||||||
|
MPI_Bcast(seq, 1, MPI_INT, 0, comm);
|
||||||
|
MPI_Bcast(*handle, HANDLE_SIZE_MAX, MPI_CHAR, 0, comm);
|
||||||
|
|
||||||
|
datum->is_active = false;
|
||||||
|
OBJ_RELEASE(datum);
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
39
ompi/mpiext/cr/c/inc_register_callback.c
Обычный файл
39
ompi/mpiext/cr/c/inc_register_callback.c
Обычный файл
@ -0,0 +1,39 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "ompi/errhandler/errhandler.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_INC_register_callback";
|
||||||
|
|
||||||
|
int OMPI_CR_INC_register_callback(OMPI_CR_INC_callback_event_t event,
|
||||||
|
OMPI_CR_INC_callback_function function,
|
||||||
|
OMPI_CR_INC_callback_function *prev_function)
|
||||||
|
{
|
||||||
|
int rc;
|
||||||
|
|
||||||
|
if ( MPI_PARAM_CHECK ) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
|
||||||
|
rc = opal_cr_user_inc_register_callback(event, function, prev_function);
|
||||||
|
|
||||||
|
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||||
|
}
|
120
ompi/mpiext/cr/c/migrate.c
Обычный файл
120
ompi/mpiext/cr/c/migrate.c
Обычный файл
@ -0,0 +1,120 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "ompi/info/info.h"
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_Migrate";
|
||||||
|
|
||||||
|
int OMPI_CR_Migrate(MPI_Comm comm, char *hostname, int rank, MPI_Info *info)
|
||||||
|
{
|
||||||
|
int ret = MPI_SUCCESS;
|
||||||
|
orte_snapc_base_request_op_t *datum = NULL;
|
||||||
|
int my_rank, my_size, i;
|
||||||
|
char loc_hostname[MPI_MAX_PROCESSOR_NAME];
|
||||||
|
int my_vpid;
|
||||||
|
int info_flag;
|
||||||
|
char info_value[6];
|
||||||
|
int my_off_node = (int)false;
|
||||||
|
|
||||||
|
/* argument checking */
|
||||||
|
if (MPI_PARAM_CHECK) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Setup the data structure for the operation
|
||||||
|
*/
|
||||||
|
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||||
|
datum->event = ORTE_SNAPC_OP_MIGRATE;
|
||||||
|
datum->is_active = true;
|
||||||
|
|
||||||
|
MPI_Comm_rank(comm, &my_rank);
|
||||||
|
MPI_Comm_size(comm, &my_size);
|
||||||
|
if( 0 == my_rank ) {
|
||||||
|
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||||
|
} else {
|
||||||
|
datum->leader = -1; /* Unknown from non-root ranks */
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Gather all preferences to the root
|
||||||
|
*/
|
||||||
|
if( NULL == hostname ) {
|
||||||
|
loc_hostname[0] = '\0';
|
||||||
|
} else {
|
||||||
|
strncpy(loc_hostname, hostname, strlen(hostname));
|
||||||
|
loc_hostname[strlen(hostname)] = '\0';
|
||||||
|
}
|
||||||
|
my_vpid = (int) ORTE_PROC_MY_NAME->vpid;
|
||||||
|
|
||||||
|
if( 0 == my_rank ) {
|
||||||
|
datum->mig_num = my_size;
|
||||||
|
datum->mig_vpids = malloc(sizeof(int) * my_size);
|
||||||
|
datum->mig_host_pref = malloc(sizeof(char) * my_size * MPI_MAX_PROCESSOR_NAME);
|
||||||
|
datum->mig_vpid_pref = malloc(sizeof(int) * my_size);
|
||||||
|
datum->mig_off_node = malloc(sizeof(int) * my_size);
|
||||||
|
|
||||||
|
for( i = 0; i < my_size; ++i ) {
|
||||||
|
(datum->mig_vpids)[i] = 0;
|
||||||
|
(datum->mig_host_pref)[i][0] = '\0';
|
||||||
|
(datum->mig_vpid_pref)[i] = 0;
|
||||||
|
(datum->mig_off_node)[i] = (int)false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
my_off_node = (int)false;
|
||||||
|
if( NULL != info ) {
|
||||||
|
MPI_Info_get(*info, "CR_OFF_NODE", 5, info_value, &info_flag);
|
||||||
|
if( info_flag ) {
|
||||||
|
if( 0 == strncmp(info_value, "true", strlen("true")) ) {
|
||||||
|
my_off_node = (int)true;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
MPI_Gather(&my_vpid, 1, MPI_INT,
|
||||||
|
(datum->mig_vpids), 1, MPI_INT, 0, comm);
|
||||||
|
MPI_Gather(loc_hostname, MPI_MAX_PROCESSOR_NAME, MPI_CHAR,
|
||||||
|
(datum->mig_host_pref), MPI_MAX_PROCESSOR_NAME, MPI_CHAR, 0, comm);
|
||||||
|
MPI_Gather(&my_vpid, 1, MPI_INT,
|
||||||
|
(datum->mig_vpid_pref), 1, MPI_INT, 0, comm);
|
||||||
|
MPI_Gather(&my_off_node, 1, MPI_INT,
|
||||||
|
(datum->mig_off_node), 1, MPI_INT, 0, comm);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Leader sends the request
|
||||||
|
*/
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
ret = orte_snapc.request_op(datum);
|
||||||
|
if( OMPI_SUCCESS != ret ) {
|
||||||
|
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||||
|
FUNC_NAME);
|
||||||
|
}
|
||||||
|
OPAL_CR_EXIT_LIBRARY();
|
||||||
|
|
||||||
|
datum->is_active = false;
|
||||||
|
OBJ_RELEASE(datum);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* All processes must sync before leaving
|
||||||
|
*/
|
||||||
|
MPI_Barrier(comm);
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
69
ompi/mpiext/cr/c/quiesce_checkpoint.c
Обычный файл
69
ompi/mpiext/cr/c/quiesce_checkpoint.c
Обычный файл
@ -0,0 +1,69 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "ompi/info/info.h"
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_Quiesce_checkpoint";
|
||||||
|
|
||||||
|
int OMPI_CR_Quiesce_checkpoint(MPI_Comm commP, char **handle, int *seq, MPI_Info *info)
|
||||||
|
{
|
||||||
|
int ret = MPI_SUCCESS;
|
||||||
|
MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */
|
||||||
|
orte_snapc_base_request_op_t *datum = NULL;
|
||||||
|
int my_rank;
|
||||||
|
|
||||||
|
/* argument checking */
|
||||||
|
if (MPI_PARAM_CHECK) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Setup the data structure for the operation
|
||||||
|
*/
|
||||||
|
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||||
|
datum->event = ORTE_SNAPC_OP_QUIESCE_CHECKPOINT;
|
||||||
|
datum->is_active = true;
|
||||||
|
|
||||||
|
MPI_Comm_rank(comm, &my_rank);
|
||||||
|
if( 0 == my_rank ) {
|
||||||
|
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||||
|
} else {
|
||||||
|
datum->leader = -1; /* Unknown from non-root ranks */
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Since we are quiescent, then this is a local operation
|
||||||
|
*/
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
ret = orte_snapc.request_op(datum);
|
||||||
|
/*ret = ompi_crcp_base_quiesce_start(info);*/
|
||||||
|
if( OMPI_SUCCESS != ret ) {
|
||||||
|
OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_OTHER,
|
||||||
|
FUNC_NAME);
|
||||||
|
}
|
||||||
|
OPAL_CR_EXIT_LIBRARY();
|
||||||
|
|
||||||
|
*handle = strdup(datum->global_handle);
|
||||||
|
*seq = datum->seq_num;
|
||||||
|
|
||||||
|
datum->is_active = false;
|
||||||
|
OBJ_RELEASE(datum);
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
74
ompi/mpiext/cr/c/quiesce_end.c
Обычный файл
74
ompi/mpiext/cr/c/quiesce_end.c
Обычный файл
@ -0,0 +1,74 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "ompi/info/info.h"
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_Quiesce_end";
|
||||||
|
|
||||||
|
int OMPI_CR_Quiesce_end(MPI_Comm commP, MPI_Info *info)
|
||||||
|
{
|
||||||
|
int ret = MPI_SUCCESS;
|
||||||
|
MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */
|
||||||
|
orte_snapc_base_request_op_t *datum = NULL;
|
||||||
|
int my_rank;
|
||||||
|
|
||||||
|
/* argument checking */
|
||||||
|
if (MPI_PARAM_CHECK) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Setup the data structure for the operation
|
||||||
|
*/
|
||||||
|
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||||
|
datum->event = ORTE_SNAPC_OP_QUIESCE_END;
|
||||||
|
datum->is_active = true;
|
||||||
|
|
||||||
|
MPI_Comm_rank(comm, &my_rank);
|
||||||
|
if( 0 == my_rank ) {
|
||||||
|
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||||
|
} else {
|
||||||
|
datum->leader = -1; /* Unknown from non-root ranks */
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Leader sends the request
|
||||||
|
*/
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
ret = orte_snapc.request_op(datum);
|
||||||
|
/*ret = ompi_crcp_base_quiesce_end(info);*/
|
||||||
|
if( OMPI_SUCCESS != ret ) {
|
||||||
|
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||||
|
FUNC_NAME);
|
||||||
|
}
|
||||||
|
OPAL_CR_EXIT_LIBRARY();
|
||||||
|
|
||||||
|
/*
|
||||||
|
* All processes must make this call before it can complete
|
||||||
|
*/
|
||||||
|
MPI_Barrier(comm);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* (Old) info logic
|
||||||
|
*/
|
||||||
|
/*cur_datum.epoch = -1;*/
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
210
ompi/mpiext/cr/c/quiesce_start.c
Обычный файл
210
ompi/mpiext/cr/c/quiesce_start.c
Обычный файл
@ -0,0 +1,210 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "ompi/info/info.h"
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_Quiesce_start";
|
||||||
|
|
||||||
|
int OMPI_CR_Quiesce_start(MPI_Comm commP, MPI_Info *info)
|
||||||
|
{
|
||||||
|
int ret = MPI_SUCCESS;
|
||||||
|
MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */
|
||||||
|
orte_snapc_base_request_op_t *datum = NULL;
|
||||||
|
int my_rank;
|
||||||
|
|
||||||
|
/* argument checking */
|
||||||
|
if (MPI_PARAM_CHECK) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Setup the data structure for the operation
|
||||||
|
*/
|
||||||
|
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||||
|
datum->event = ORTE_SNAPC_OP_QUIESCE_START;
|
||||||
|
datum->is_active = true;
|
||||||
|
|
||||||
|
MPI_Comm_rank(comm, &my_rank);
|
||||||
|
if( 0 == my_rank ) {
|
||||||
|
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||||
|
} else {
|
||||||
|
datum->leader = -1; /* Unknown from non-root ranks */
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* All processes must make this call before it can start
|
||||||
|
*/
|
||||||
|
MPI_Barrier(comm);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Leader sends the request
|
||||||
|
*/
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
ret = orte_snapc.request_op(datum);
|
||||||
|
/*ret = ompi_crcp_base_quiesce_start(info);*/
|
||||||
|
if( OMPI_SUCCESS != ret ) {
|
||||||
|
OBJ_RELEASE(datum);
|
||||||
|
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||||
|
FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_CR_EXIT_LIBRARY();
|
||||||
|
|
||||||
|
datum->is_active = false;
|
||||||
|
OBJ_RELEASE(datum);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* (Old) info logic
|
||||||
|
*/
|
||||||
|
/*ompi_info_set((ompi_info_t*)*info, "target", cur_datum.target_dir);*/
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*****************
|
||||||
|
* Local Functions
|
||||||
|
******************/
|
||||||
|
#if 0
|
||||||
|
/* Info keys:
|
||||||
|
*
|
||||||
|
* - crs:
|
||||||
|
* none = (Default) No CRS Service
|
||||||
|
* default = Whatever CRS service MPI chooses
|
||||||
|
* blcr = BLCR
|
||||||
|
* self = app level callbacks
|
||||||
|
*
|
||||||
|
* - cmdline:
|
||||||
|
* Command line to restart the process with.
|
||||||
|
* If empty, the user must manually enter it
|
||||||
|
*
|
||||||
|
* - target:
|
||||||
|
* Absolute path to the target directory.
|
||||||
|
*
|
||||||
|
* - handle:
|
||||||
|
* first = Earliest checkpoint directory available
|
||||||
|
* last = Most recent checkpoint directory available
|
||||||
|
* [global:local] = handle provided by the MPI library
|
||||||
|
*
|
||||||
|
* - restarting:
|
||||||
|
* 0 = not restarting
|
||||||
|
* 1 = restarting
|
||||||
|
*
|
||||||
|
* - checkpointing:
|
||||||
|
* 0 = No need to prepare for checkpointing
|
||||||
|
* 1 = MPI should prepare for checkpointing
|
||||||
|
*
|
||||||
|
* - inflight:
|
||||||
|
* default = message
|
||||||
|
* message = Drain inflight messages at the message level
|
||||||
|
* network = Drain inflight messages at the network level (if possible)
|
||||||
|
*
|
||||||
|
* - user_space_mem:
|
||||||
|
* 0 = Memory does not need to be managed
|
||||||
|
* 1 = Memory must be in user space (i.e., not on network card
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
static int extract_info_into_datum(ompi_info_t *info, orte_snapc_base_quiesce_t *datum)
|
||||||
|
{
|
||||||
|
int info_flag = false;
|
||||||
|
int max_crs_len = 32;
|
||||||
|
bool info_bool = false;
|
||||||
|
char *info_char = NULL;
|
||||||
|
|
||||||
|
info_char = (char *) malloc(sizeof(char) * (OPAL_PATH_MAX+1));
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Key: crs
|
||||||
|
*/
|
||||||
|
ompi_info_get(info, "crs", max_crs_len, info_char, &info_flag);
|
||||||
|
if( info_flag) {
|
||||||
|
datum->crs_name = strdup(info_char);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Key: cmdline
|
||||||
|
*/
|
||||||
|
ompi_info_get(info, "cmdline", OPAL_PATH_MAX, info_char, &info_flag);
|
||||||
|
if( info_flag) {
|
||||||
|
datum->cmdline = strdup(info_char);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Key: handle
|
||||||
|
*/
|
||||||
|
ompi_info_get(info, "handle", OPAL_PATH_MAX, info_char, &info_flag);
|
||||||
|
if( info_flag) {
|
||||||
|
datum->handle = strdup(info_char);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Key: target
|
||||||
|
*/
|
||||||
|
ompi_info_get(info, "target", OPAL_PATH_MAX, info_char, &info_flag);
|
||||||
|
if( info_flag) {
|
||||||
|
datum->target_dir = strdup(info_char);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Key: restarting
|
||||||
|
*/
|
||||||
|
ompi_info_get_bool(info, "restarting", &info_bool, &info_flag);
|
||||||
|
if( info_flag ) {
|
||||||
|
datum->restarting = info_bool;
|
||||||
|
} else {
|
||||||
|
datum->restarting = false;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Key: checkpointing
|
||||||
|
*/
|
||||||
|
ompi_info_get_bool(info, "checkpointing", &info_bool, &info_flag);
|
||||||
|
if( info_flag ) {
|
||||||
|
datum->checkpointing = info_bool;
|
||||||
|
} else {
|
||||||
|
datum->checkpointing = false;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Display all values
|
||||||
|
*/
|
||||||
|
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||||
|
"crcp:bkmrk: %s extract_info: Info('crs' = '%s')",
|
||||||
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||||
|
(NULL == datum->crs_name ? "Default (none)" : datum->crs_name)));
|
||||||
|
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||||
|
"crcp:bkmrk: %s extract_info: Info('cmdline' = '%s')",
|
||||||
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||||
|
(NULL == datum->cmdline ? "Default ()" : datum->cmdline)));
|
||||||
|
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||||
|
"crcp:bkmrk: %s extract_info: Info('checkpointing' = '%c')",
|
||||||
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||||
|
(datum->checkpointing ? 'T' : 'F')));
|
||||||
|
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||||
|
"crcp:bkmrk: %s extract_info: Info('restarting' = '%c')",
|
||||||
|
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||||
|
(datum->restarting ? 'T' : 'F')));
|
||||||
|
|
||||||
|
if( NULL != info_char ) {
|
||||||
|
free(info_char);
|
||||||
|
info_char = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
#endif
|
66
ompi/mpiext/cr/c/restart.c
Обычный файл
66
ompi/mpiext/cr/c/restart.c
Обычный файл
@ -0,0 +1,66 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "ompi/info/info.h"
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_Restart";
|
||||||
|
|
||||||
|
int OMPI_CR_Restart(char *handle, int seq, MPI_Info *info)
|
||||||
|
{
|
||||||
|
int ret = MPI_SUCCESS;
|
||||||
|
MPI_Comm comm = MPI_COMM_WORLD;
|
||||||
|
orte_snapc_base_request_op_t *datum = NULL;
|
||||||
|
|
||||||
|
/* argument checking */
|
||||||
|
if (MPI_PARAM_CHECK) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Setup the data structure for the operation
|
||||||
|
*/
|
||||||
|
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||||
|
datum->event = ORTE_SNAPC_OP_RESTART;
|
||||||
|
datum->is_active = true;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Restart is not collective, so the caller is the leader
|
||||||
|
*/
|
||||||
|
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||||
|
datum->seq_num = seq;
|
||||||
|
datum->global_handle = strdup(handle);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Leader sends the request
|
||||||
|
*/
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
ret = orte_snapc.request_op(datum);
|
||||||
|
if( OMPI_SUCCESS != ret ) {
|
||||||
|
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||||
|
FUNC_NAME);
|
||||||
|
}
|
||||||
|
OPAL_CR_EXIT_LIBRARY();
|
||||||
|
|
||||||
|
datum->is_active = false;
|
||||||
|
OBJ_RELEASE(datum);
|
||||||
|
|
||||||
|
/********** If successful, should never reach this point (JJH) ******/
|
||||||
|
|
||||||
|
return ret;
|
||||||
|
}
|
39
ompi/mpiext/cr/c/self_register_checkpoint.c
Обычный файл
39
ompi/mpiext/cr/c/self_register_checkpoint.c
Обычный файл
@ -0,0 +1,39 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "ompi/errhandler/errhandler.h"
|
||||||
|
#include "opal/mca/crs/crs.h"
|
||||||
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_self_register_checkpoint_callback";
|
||||||
|
|
||||||
|
int OMPI_CR_self_register_checkpoint_callback(OMPI_CR_self_checkpoint_fn function)
|
||||||
|
{
|
||||||
|
int rc;
|
||||||
|
|
||||||
|
if ( MPI_PARAM_CHECK ) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
|
||||||
|
rc = opal_crs_base_self_register_checkpoint_callback(function);
|
||||||
|
|
||||||
|
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||||
|
}
|
39
ompi/mpiext/cr/c/self_register_continue.c
Обычный файл
39
ompi/mpiext/cr/c/self_register_continue.c
Обычный файл
@ -0,0 +1,39 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "ompi/errhandler/errhandler.h"
|
||||||
|
#include "opal/mca/crs/crs.h"
|
||||||
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_self_register_continue_callback";
|
||||||
|
|
||||||
|
int OMPI_CR_self_register_continue_callback(OMPI_CR_self_continue_fn function)
|
||||||
|
{
|
||||||
|
int rc;
|
||||||
|
|
||||||
|
if ( MPI_PARAM_CHECK ) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
|
||||||
|
rc = opal_crs_base_self_register_continue_callback(function);
|
||||||
|
|
||||||
|
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||||
|
}
|
39
ompi/mpiext/cr/c/self_register_restart.c
Обычный файл
39
ompi/mpiext/cr/c/self_register_restart.c
Обычный файл
@ -0,0 +1,39 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#include "ompi_config.h"
|
||||||
|
#include <stdio.h>
|
||||||
|
|
||||||
|
#include "ompi/mpi/c/bindings.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||||
|
|
||||||
|
#include "ompi/runtime/params.h"
|
||||||
|
#include "ompi/communicator/communicator.h"
|
||||||
|
#include "ompi/errhandler/errhandler.h"
|
||||||
|
#include "opal/mca/crs/crs.h"
|
||||||
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
|
||||||
|
static const char FUNC_NAME[] = "OMPI_CR_self_register_restart_callback";
|
||||||
|
|
||||||
|
int OMPI_CR_self_register_restart_callback(OMPI_CR_self_restart_fn function)
|
||||||
|
{
|
||||||
|
int rc;
|
||||||
|
|
||||||
|
if ( MPI_PARAM_CHECK ) {
|
||||||
|
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_CR_ENTER_LIBRARY();
|
||||||
|
|
||||||
|
rc = opal_crs_base_self_register_restart_callback(function);
|
||||||
|
|
||||||
|
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||||
|
}
|
19
ompi/mpiext/cr/configure.m4
Обычный файл
19
ompi/mpiext/cr/configure.m4
Обычный файл
@ -0,0 +1,19 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
# EXT_ompi_cr_CONFIG([action-if-found], [action-if-not-found])
|
||||||
|
# -----------------------------------------------------------
|
||||||
|
AC_DEFUN([EXT_mpiext_cr_CONFIG],[
|
||||||
|
# If we don't want FT, don't compile this component
|
||||||
|
AS_IF([test "$opal_want_ft_cr" = "1"],
|
||||||
|
[$1],
|
||||||
|
[$2])
|
||||||
|
])dnl
|
12
ompi/mpiext/cr/configure.params
Обычный файл
12
ompi/mpiext/cr/configure.params
Обычный файл
@ -0,0 +1,12 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
PARAM_CONFIG_FILES="Makefile"
|
82
ompi/mpiext/cr/mpiext_cr_c.h
Обычный файл
82
ompi/mpiext/cr/mpiext_cr_c.h
Обычный файл
@ -0,0 +1,82 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
|
||||||
|
/********************************
|
||||||
|
* C/R Interfaces
|
||||||
|
********************************/
|
||||||
|
/*
|
||||||
|
* Request a checkpoint
|
||||||
|
*/
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_Checkpoint(char **handle, int *seq, MPI_Info *info);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Request a restart
|
||||||
|
*/
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_Restart(char *handle, int seq, MPI_Info *info);
|
||||||
|
|
||||||
|
|
||||||
|
/********************************
|
||||||
|
* Migration Interface
|
||||||
|
********************************/
|
||||||
|
/*
|
||||||
|
* Request a migration
|
||||||
|
*/
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_Migrate(MPI_Comm comm, char *hostname, int rank, MPI_Info *info);
|
||||||
|
|
||||||
|
|
||||||
|
/********************************
|
||||||
|
* INC Interfaces
|
||||||
|
********************************/
|
||||||
|
typedef opal_cr_user_inc_callback_event_t OMPI_CR_INC_callback_event_t;
|
||||||
|
|
||||||
|
typedef opal_cr_user_inc_callback_state_t OMPI_CR_INC_callback_state_t;
|
||||||
|
|
||||||
|
typedef int (*OMPI_CR_INC_callback_function)(OMPI_CR_INC_callback_event_t event,
|
||||||
|
OMPI_CR_INC_callback_state_t state);
|
||||||
|
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_INC_register_callback(OMPI_CR_INC_callback_event_t event,
|
||||||
|
OMPI_CR_INC_callback_function function,
|
||||||
|
OMPI_CR_INC_callback_function *prev_function);
|
||||||
|
|
||||||
|
|
||||||
|
/********************************
|
||||||
|
* SELF CRS Application Interfaces
|
||||||
|
********************************/
|
||||||
|
typedef int (*OMPI_CR_self_checkpoint_fn)(char **restart_cmd);
|
||||||
|
typedef int (*OMPI_CR_self_restart_fn)(void);
|
||||||
|
typedef int (*OMPI_CR_self_continue_fn)(void);
|
||||||
|
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_self_register_checkpoint_callback(OMPI_CR_self_checkpoint_fn function);
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_self_register_restart_callback(OMPI_CR_self_restart_fn function);
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_self_register_continue_callback(OMPI_CR_self_continue_fn function);
|
||||||
|
|
||||||
|
|
||||||
|
/********************************
|
||||||
|
* Quiescence Interfaces
|
||||||
|
********************************/
|
||||||
|
/*
|
||||||
|
* Start the Quiescent region.
|
||||||
|
* Note: 'comm' required to be MPI_COMM_WORLD
|
||||||
|
*/
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_Quiesce_start(MPI_Comm comm, MPI_Info *info);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Request a checkpoint during a quiescent region
|
||||||
|
* Note: 'comm' required to be MPI_COMM_WORLD
|
||||||
|
*/
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_Quiesce_checkpoint(MPI_Comm comm, char **handle, int *seq, MPI_Info *info);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* End the Quiescent Region
|
||||||
|
* Note: 'comm' required to be MPI_COMM_WORLD
|
||||||
|
*/
|
||||||
|
OMPI_DECLSPEC int OMPI_CR_Quiesce_end(MPI_Comm comm, MPI_Info *info);
|
@ -1,6 +1,6 @@
|
|||||||
/* -*- Mode: C; c-basic-offset:4 ; -*- */
|
/* -*- Mode: C; c-basic-offset:4 ; -*- */
|
||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
||||||
@ -43,6 +43,7 @@
|
|||||||
#include "opal/util/output.h"
|
#include "opal/util/output.h"
|
||||||
#include "opal/mca/crs/crs.h"
|
#include "opal/mca/crs/crs.h"
|
||||||
#include "opal/mca/crs/base/base.h"
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
#include "opal/mca/installdirs/installdirs.h"
|
||||||
#include "opal/runtime/opal_cr.h"
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
|
||||||
#include "orte/mca/snapc/snapc.h"
|
#include "orte/mca/snapc/snapc.h"
|
||||||
@ -56,6 +57,18 @@
|
|||||||
#include "ompi/mca/crcp/base/base.h"
|
#include "ompi/mca/crcp/base/base.h"
|
||||||
#include "ompi/communicator/communicator.h"
|
#include "ompi/communicator/communicator.h"
|
||||||
#include "ompi/runtime/ompi_cr.h"
|
#include "ompi/runtime/ompi_cr.h"
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
#include "orte/runtime/orte_globals.h"
|
||||||
|
#include "ompi/debuggers/debuggers.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
OMPI_DECLSPEC int MPIR_checkpointable = 0;
|
||||||
|
OMPI_DECLSPEC char * MPIR_controller_hostname = NULL;
|
||||||
|
OMPI_DECLSPEC char * MPIR_checkpoint_command = NULL;
|
||||||
|
OMPI_DECLSPEC char * MPIR_restart_command = NULL;
|
||||||
|
OMPI_DECLSPEC char * MPIR_checkpoint_listing_command = NULL;
|
||||||
|
#endif
|
||||||
|
|
||||||
/*************
|
/*************
|
||||||
* Local functions
|
* Local functions
|
||||||
@ -68,8 +81,6 @@ static int ompi_cr_coord_post_ckpt(void);
|
|||||||
static int ompi_cr_coord_post_restart(void);
|
static int ompi_cr_coord_post_restart(void);
|
||||||
static int ompi_cr_coord_post_continue(void);
|
static int ompi_cr_coord_post_continue(void);
|
||||||
|
|
||||||
bool ompi_cr_continue_like_restart = false;
|
|
||||||
|
|
||||||
/*************
|
/*************
|
||||||
* Local vars
|
* Local vars
|
||||||
*************/
|
*************/
|
||||||
@ -157,15 +168,59 @@ int ompi_cr_init(void)
|
|||||||
ompi_cr_output = opal_cr_output;
|
ompi_cr_output = opal_cr_output;
|
||||||
}
|
}
|
||||||
|
|
||||||
/* Typically this is not needed. Individual BTLs will set this as needed */
|
|
||||||
ompi_cr_continue_like_restart = false;
|
|
||||||
|
|
||||||
opal_output_verbose(10, ompi_cr_output,
|
opal_output_verbose(10, ompi_cr_output,
|
||||||
"ompi_cr: init: ompi_cr_init()");
|
"ompi_cr: init: ompi_cr_init()");
|
||||||
|
|
||||||
/* Register the OMPI interlevel coordination callback */
|
/* Register the OMPI interlevel coordination callback */
|
||||||
opal_cr_reg_coord_callback(ompi_cr_coord, &prev_coord_callback);
|
opal_cr_reg_coord_callback(ompi_cr_coord, &prev_coord_callback);
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/* Check for C/R enabled debugging */
|
||||||
|
if( MPIR_debug_with_checkpoint ) {
|
||||||
|
char *uri = NULL;
|
||||||
|
char *sep = NULL;
|
||||||
|
char *hostname = NULL;
|
||||||
|
|
||||||
|
/* Mark as debuggable with C/R */
|
||||||
|
MPIR_checkpointable = 1;
|
||||||
|
|
||||||
|
/* Set the checkpoint and restart commands */
|
||||||
|
/* Add the full path to the binary */
|
||||||
|
asprintf(&MPIR_checkpoint_command,
|
||||||
|
"%s/ompi-checkpoint --crdebug --hnp-jobid %u",
|
||||||
|
opal_install_dirs.bindir,
|
||||||
|
ORTE_PROC_MY_HNP->jobid);
|
||||||
|
asprintf(&MPIR_restart_command,
|
||||||
|
"%s/ompi-restart --crdebug ",
|
||||||
|
opal_install_dirs.bindir);
|
||||||
|
asprintf(&MPIR_checkpoint_listing_command,
|
||||||
|
"%s/ompi-checkpoint -l --crdebug ",
|
||||||
|
opal_install_dirs.bindir);
|
||||||
|
|
||||||
|
/* Set contact information for HNP */
|
||||||
|
uri = strdup(orte_process_info.my_hnp_uri);
|
||||||
|
hostname = strchr(uri, ';') + 1;
|
||||||
|
sep = strchr(hostname, ';');
|
||||||
|
if (sep) {
|
||||||
|
*sep = 0;
|
||||||
|
}
|
||||||
|
if (strncmp(hostname, "tcp://", 6) == 0) {
|
||||||
|
hostname += 6;
|
||||||
|
sep = strchr(hostname, ':');
|
||||||
|
*sep = 0;
|
||||||
|
MPIR_controller_hostname = strdup(hostname);
|
||||||
|
} else {
|
||||||
|
MPIR_controller_hostname = strdup("localhost");
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Cleanup */
|
||||||
|
if( NULL != uri ) {
|
||||||
|
free(uri);
|
||||||
|
uri = NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
return OMPI_SUCCESS;
|
return OMPI_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -196,9 +251,6 @@ int ompi_cr_coord(int state)
|
|||||||
* take action given the state.
|
* take action given the state.
|
||||||
*/
|
*/
|
||||||
if(OPAL_CRS_CHECKPOINT == state) {
|
if(OPAL_CRS_CHECKPOINT == state) {
|
||||||
/* Default: use the fast way */
|
|
||||||
ompi_cr_continue_like_restart = false;
|
|
||||||
|
|
||||||
/* Do Checkpoint Phase work */
|
/* Do Checkpoint Phase work */
|
||||||
ret = ompi_cr_coord_pre_ckpt();
|
ret = ompi_cr_coord_pre_ckpt();
|
||||||
if( ret == OMPI_EXISTS) {
|
if( ret == OMPI_EXISTS) {
|
||||||
@ -245,10 +297,30 @@ int ompi_cr_coord(int state)
|
|||||||
else if (OPAL_CRS_CONTINUE == state ) {
|
else if (OPAL_CRS_CONTINUE == state ) {
|
||||||
/* Do Continue Phase work */
|
/* Do Continue Phase work */
|
||||||
ompi_cr_coord_post_continue();
|
ompi_cr_coord_post_continue();
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/*
|
||||||
|
* If C/R enabled debugging,
|
||||||
|
* wait here for debugger to attach
|
||||||
|
*/
|
||||||
|
if( MPIR_debug_with_checkpoint ) {
|
||||||
|
MPIR_checkpoint_debugger_breakpoint();
|
||||||
|
}
|
||||||
|
#endif
|
||||||
}
|
}
|
||||||
else if (OPAL_CRS_RESTART == state ) {
|
else if (OPAL_CRS_RESTART == state ) {
|
||||||
/* Do Restart Phase work */
|
/* Do Restart Phase work */
|
||||||
ompi_cr_coord_post_restart();
|
ompi_cr_coord_post_restart();
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/*
|
||||||
|
* If C/R enabled debugging,
|
||||||
|
* wait here for debugger to attach
|
||||||
|
*/
|
||||||
|
if( MPIR_debug_with_checkpoint ) {
|
||||||
|
MPIR_checkpoint_debugger_breakpoint();
|
||||||
|
}
|
||||||
|
#endif
|
||||||
}
|
}
|
||||||
else if (OPAL_CRS_TERM == state ) {
|
else if (OPAL_CRS_TERM == state ) {
|
||||||
/* Do Continue Phase work in prep to terminate the application */
|
/* Do Continue Phase work in prep to terminate the application */
|
||||||
@ -330,7 +402,7 @@ static int ompi_cr_coord_pre_continue(void) {
|
|||||||
opal_output_verbose(10, ompi_cr_output,
|
opal_output_verbose(10, ompi_cr_output,
|
||||||
"ompi_cr: coord_pre_continue: ompi_cr_coord_pre_continue()");
|
"ompi_cr: coord_pre_continue: ompi_cr_coord_pre_continue()");
|
||||||
|
|
||||||
if( ompi_cr_continue_like_restart ) {
|
if( orte_cr_continue_like_restart ) {
|
||||||
/* Mimic ompi_cr_coord_pre_restart(); */
|
/* Mimic ompi_cr_coord_pre_restart(); */
|
||||||
if( ORTE_SUCCESS != (ret = mca_pml.pml_ft_event(OPAL_CRS_CONTINUE))) {
|
if( ORTE_SUCCESS != (ret = mca_pml.pml_ft_event(OPAL_CRS_CONTINUE))) {
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -26,6 +26,7 @@
|
|||||||
#define OMPI_CR_H
|
#define OMPI_CR_H
|
||||||
|
|
||||||
#include "ompi_config.h"
|
#include "ompi_config.h"
|
||||||
|
#include "orte/runtime/orte_cr.h"
|
||||||
|
|
||||||
BEGIN_C_DECLS
|
BEGIN_C_DECLS
|
||||||
|
|
||||||
@ -49,11 +50,13 @@ BEGIN_C_DECLS
|
|||||||
*/
|
*/
|
||||||
OMPI_DECLSPEC extern int ompi_cr_output;
|
OMPI_DECLSPEC extern int ompi_cr_output;
|
||||||
|
|
||||||
/*
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
* If one of the BTLs that shutdown require a full, clean rebuild of the
|
OMPI_DECLSPEC extern int MPIR_checkpointable;
|
||||||
* point-to-point stack on 'continue' as well as 'restart'.
|
OMPI_DECLSPEC extern char * MPIR_controller_hostname;
|
||||||
*/
|
OMPI_DECLSPEC extern char * MPIR_checkpoint_command;
|
||||||
OPAL_DECLSPEC extern bool ompi_cr_continue_like_restart;
|
OMPI_DECLSPEC extern char * MPIR_restart_command;
|
||||||
|
OMPI_DECLSPEC extern char * MPIR_checkpoint_listing_command;
|
||||||
|
#endif
|
||||||
|
|
||||||
END_C_DECLS
|
END_C_DECLS
|
||||||
|
|
||||||
|
@ -51,6 +51,8 @@
|
|||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
#include "opal/mca/crs/crs.h"
|
#include "opal/mca/crs/crs.h"
|
||||||
#include "opal/mca/crs/base/base.h"
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
#endif
|
#endif
|
||||||
#include "opal/runtime/opal.h"
|
#include "opal/runtime/opal.h"
|
||||||
#include "opal/dss/dss.h"
|
#include "opal/dss/dss.h"
|
||||||
@ -114,6 +116,8 @@
|
|||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
#include "orte/mca/snapc/snapc.h"
|
#include "orte/mca/snapc/snapc.h"
|
||||||
#include "orte/mca/snapc/base/base.h"
|
#include "orte/mca/snapc/base/base.h"
|
||||||
|
#include "orte/mca/sstore/sstore.h"
|
||||||
|
#include "orte/mca/sstore/base/base.h"
|
||||||
#endif
|
#endif
|
||||||
#if ORTE_ENABLE_SENSORS
|
#if ORTE_ENABLE_SENSORS
|
||||||
#include "orte/mca/sensor/sensor.h"
|
#include "orte/mca/sensor/sensor.h"
|
||||||
@ -330,6 +334,14 @@ void ompi_info_open_components(void)
|
|||||||
map->type = strdup("crs");
|
map->type = strdup("crs");
|
||||||
map->components = &opal_crs_base_components_available;
|
map->components = &opal_crs_base_components_available;
|
||||||
opal_pointer_array_add(&component_map, map);
|
opal_pointer_array_add(&component_map, map);
|
||||||
|
|
||||||
|
if (OPAL_SUCCESS != opal_compress_base_open()) {
|
||||||
|
goto error;
|
||||||
|
}
|
||||||
|
map = OBJ_NEW(ompi_info_component_map_t);
|
||||||
|
map->type = strdup("compress");
|
||||||
|
map->components = &opal_compress_base_components_available;
|
||||||
|
opal_pointer_array_add(&component_map, map);
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
/* OPAL's installdirs base open has already been called as part of
|
/* OPAL's installdirs base open has already been called as part of
|
||||||
@ -460,6 +472,14 @@ void ompi_info_open_components(void)
|
|||||||
opal_pointer_array_add(&component_map, map);
|
opal_pointer_array_add(&component_map, map);
|
||||||
|
|
||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
if (ORTE_SUCCESS != orte_sstore_base_open()) {
|
||||||
|
goto error;
|
||||||
|
}
|
||||||
|
map = OBJ_NEW(ompi_info_component_map_t);
|
||||||
|
map->type = strdup("sstore");
|
||||||
|
map->components = &orte_sstore_base_components_available;
|
||||||
|
opal_pointer_array_add(&component_map, map);
|
||||||
|
|
||||||
if (ORTE_SUCCESS != orte_snapc_base_open()) {
|
if (ORTE_SUCCESS != orte_snapc_base_open()) {
|
||||||
goto error;
|
goto error;
|
||||||
}
|
}
|
||||||
@ -680,6 +700,7 @@ void ompi_info_close_components()
|
|||||||
#if !ORTE_DISABLE_FULL_SUPPORT
|
#if !ORTE_DISABLE_FULL_SUPPORT
|
||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
(void) orte_snapc_base_close();
|
(void) orte_snapc_base_close();
|
||||||
|
(void) orte_sstore_base_close();
|
||||||
#endif
|
#endif
|
||||||
(void) orte_filem_base_close();
|
(void) orte_filem_base_close();
|
||||||
(void) orte_iof_base_close();
|
(void) orte_iof_base_close();
|
||||||
|
@ -37,6 +37,9 @@
|
|||||||
#include "opal/class/opal_object.h"
|
#include "opal/class/opal_object.h"
|
||||||
#include "opal/class/opal_pointer_array.h"
|
#include "opal/class/opal_pointer_array.h"
|
||||||
#include "opal/runtime/opal.h"
|
#include "opal/runtime/opal.h"
|
||||||
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
#endif
|
||||||
#include "opal/util/cmd_line.h"
|
#include "opal/util/cmd_line.h"
|
||||||
#include "opal/util/argv.h"
|
#include "opal/util/argv.h"
|
||||||
#include "opal/mca/base/base.h"
|
#include "opal/mca/base/base.h"
|
||||||
@ -196,7 +199,9 @@ int main(int argc, char *argv[])
|
|||||||
opal_pointer_array_add(&mca_types, "installdirs");
|
opal_pointer_array_add(&mca_types, "installdirs");
|
||||||
opal_pointer_array_add(&mca_types, "sysinfo");
|
opal_pointer_array_add(&mca_types, "sysinfo");
|
||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
opal_cr_set_enabled(true);
|
||||||
opal_pointer_array_add(&mca_types, "crs");
|
opal_pointer_array_add(&mca_types, "crs");
|
||||||
|
opal_pointer_array_add(&mca_types, "compress");
|
||||||
#endif
|
#endif
|
||||||
opal_pointer_array_add(&mca_types, "dpm");
|
opal_pointer_array_add(&mca_types, "dpm");
|
||||||
opal_pointer_array_add(&mca_types, "pubsub");
|
opal_pointer_array_add(&mca_types, "pubsub");
|
||||||
@ -228,6 +233,7 @@ int main(int argc, char *argv[])
|
|||||||
opal_pointer_array_add(&mca_types, "routed");
|
opal_pointer_array_add(&mca_types, "routed");
|
||||||
opal_pointer_array_add(&mca_types, "plm");
|
opal_pointer_array_add(&mca_types, "plm");
|
||||||
#if OPAL_ENABLE_FT_CR == 1
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
opal_pointer_array_add(&mca_types, "sstore");
|
||||||
opal_pointer_array_add(&mca_types, "snapc");
|
opal_pointer_array_add(&mca_types, "snapc");
|
||||||
#endif
|
#endif
|
||||||
#if ORTE_ENABLE_SENSORS
|
#if ORTE_ENABLE_SENSORS
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2006 The University of Tennessee and The University
|
* Copyright (c) 2004-2006 The University of Tennessee and The University
|
||||||
@ -515,6 +515,7 @@ void ompi_info_do_config(bool want_all)
|
|||||||
char *wtime_support;
|
char *wtime_support;
|
||||||
char *symbol_visibility;
|
char *symbol_visibility;
|
||||||
char *ft_support;
|
char *ft_support;
|
||||||
|
char *crdebug_support;
|
||||||
/* Do a little preprocessor trickery here to figure ompi_info_out the
|
/* Do a little preprocessor trickery here to figure ompi_info_out the
|
||||||
* tri-state of MPI_PARAM_CHECK (which will be either 0, 1, or
|
* tri-state of MPI_PARAM_CHECK (which will be either 0, 1, or
|
||||||
* ompi_mpi_param_check). The preprocessor will only allow
|
* ompi_mpi_param_check). The preprocessor will only allow
|
||||||
@ -583,6 +584,9 @@ void ompi_info_do_config(bool want_all)
|
|||||||
asprintf(&ft_support, "%s (checkpoint thread: %s)",
|
asprintf(&ft_support, "%s (checkpoint thread: %s)",
|
||||||
OPAL_ENABLE_FT ? "yes" : "no", OPAL_ENABLE_FT_THREAD ? "yes" : "no");;
|
OPAL_ENABLE_FT ? "yes" : "no", OPAL_ENABLE_FT_THREAD ? "yes" : "no");;
|
||||||
|
|
||||||
|
asprintf(&crdebug_support, "%s",
|
||||||
|
OPAL_ENABLE_CRDEBUG ? "yes" : "no");
|
||||||
|
|
||||||
/* output values */
|
/* output values */
|
||||||
ompi_info_out("Configured by", "config:user", OMPI_CONFIGURE_USER);
|
ompi_info_out("Configured by", "config:user", OMPI_CONFIGURE_USER);
|
||||||
ompi_info_out("Configured on", "config:timestamp", OMPI_CONFIGURE_DATE);
|
ompi_info_out("Configured on", "config:timestamp", OMPI_CONFIGURE_DATE);
|
||||||
@ -834,6 +838,9 @@ void ompi_info_do_config(bool want_all)
|
|||||||
ompi_info_out("FT Checkpoint support", "options:ft_support", ft_support);
|
ompi_info_out("FT Checkpoint support", "options:ft_support", ft_support);
|
||||||
free(ft_support);
|
free(ft_support);
|
||||||
|
|
||||||
|
ompi_info_out("C/R Enabled Debugging", "options:crdebug_support", crdebug_support);
|
||||||
|
free(crdebug_support);
|
||||||
|
|
||||||
ompi_info_out_int("MPI_MAX_PROCESSOR_NAME", "options:mpi-max-processor-name",
|
ompi_info_out_int("MPI_MAX_PROCESSOR_NAME", "options:mpi-max-processor-name",
|
||||||
MPI_MAX_PROCESSOR_NAME);
|
MPI_MAX_PROCESSOR_NAME);
|
||||||
ompi_info_out_int("MPI_MAX_ERROR_STRING", "options:mpi-max-error-string",
|
ompi_info_out_int("MPI_MAX_ERROR_STRING", "options:mpi-max-error-string",
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
#
|
#
|
||||||
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
# University Research and Technology
|
# University Research and Technology
|
||||||
# Corporation. All rights reserved.
|
# Corporation. All rights reserved.
|
||||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -39,6 +39,7 @@ install-exec-hook:
|
|||||||
if WANT_FT
|
if WANT_FT
|
||||||
(cd $(DESTDIR)$(bindir); rm -f ompi-checkpoint$(EXEEXT); $(LN_S) orte-checkpoint$(EXEEXT) ompi-checkpoint$(EXEEXT))
|
(cd $(DESTDIR)$(bindir); rm -f ompi-checkpoint$(EXEEXT); $(LN_S) orte-checkpoint$(EXEEXT) ompi-checkpoint$(EXEEXT))
|
||||||
(cd $(DESTDIR)$(bindir); rm -f ompi-restart$(EXEEXT); $(LN_S) orte-restart$(EXEEXT) ompi-restart$(EXEEXT))
|
(cd $(DESTDIR)$(bindir); rm -f ompi-restart$(EXEEXT); $(LN_S) orte-restart$(EXEEXT) ompi-restart$(EXEEXT))
|
||||||
|
(cd $(DESTDIR)$(bindir); rm -f ompi-migrate$(EXEEXT); $(LN_S) orte-migrate$(EXEEXT) ompi-migrate$(EXEEXT))
|
||||||
endif
|
endif
|
||||||
|
|
||||||
uninstall-local:
|
uninstall-local:
|
||||||
@ -50,7 +51,8 @@ uninstall-local:
|
|||||||
$(DESTDIR)$(bindir)/ompi-top$(EXEEXT)
|
$(DESTDIR)$(bindir)/ompi-top$(EXEEXT)
|
||||||
if WANT_FT
|
if WANT_FT
|
||||||
rm -f $(DESTDIR)$(bindir)/ompi-checkpoint$(EXEEXT) \
|
rm -f $(DESTDIR)$(bindir)/ompi-checkpoint$(EXEEXT) \
|
||||||
$(DESTDIR)$(bindir)/ompi-restart$(EXEEXT)
|
$(DESTDIR)$(bindir)/ompi-restart$(EXEEXT) \
|
||||||
|
$(DESTDIR)$(bindir)/ompi-migrate$(EXEEXT)
|
||||||
endif
|
endif
|
||||||
|
|
||||||
endif # !ORTE_DISABLE_FULL_SUPPORT
|
endif # !ORTE_DISABLE_FULL_SUPPORT
|
||||||
@ -95,6 +97,12 @@ $(top_builddir)/orte/tools/orte-restart/orte-restart.1:
|
|||||||
ompi-restart.1: $(top_builddir)/orte/tools/orte-restart/orte-restart.1
|
ompi-restart.1: $(top_builddir)/orte/tools/orte-restart/orte-restart.1
|
||||||
cp -f $(top_builddir)/orte/tools/orte-restart/orte-restart.1 ompi-restart.1
|
cp -f $(top_builddir)/orte/tools/orte-restart/orte-restart.1 ompi-restart.1
|
||||||
|
|
||||||
|
$(top_builddir)/orte/tools/orte-migrate/orte-migrate.1:
|
||||||
|
(cd $(top_builddir)/orte/tools/orte-migrate && $(MAKE) $(AM_MAKEFLAGS) orte-migrate.1)
|
||||||
|
|
||||||
|
ompi-migrate.1: $(top_builddir)/orte/tools/orte-migrate/orte-migrate.1
|
||||||
|
cp -f $(top_builddir)/orte/tools/orte-migrate/orte-migrate.1 ompi-migrate.1
|
||||||
|
|
||||||
$(top_builddir)/orte/tools/orte-top/orte-top.1:
|
$(top_builddir)/orte/tools/orte-top/orte-top.1:
|
||||||
(cd $(top_builddir)/orte/tools/orte-top && $(MAKE) $(AM_MAKEFLAGS) orte-top.1)
|
(cd $(top_builddir)/orte/tools/orte-top && $(MAKE) $(AM_MAKEFLAGS) orte-top.1)
|
||||||
|
|
||||||
|
@ -541,4 +541,27 @@ OPAL_WITH_OPTION_MIN_MAX_VALUE(datarep_string, 128, 64, 256)
|
|||||||
AC_ARG_WITH([libltdl],
|
AC_ARG_WITH([libltdl],
|
||||||
[AC_HELP_STRING([--with-libltdl(=DIR)],
|
[AC_HELP_STRING([--with-libltdl(=DIR)],
|
||||||
[Where to find libltdl (this option is ignored if --disable-dlopen is used). DIR can take one of three values: "internal", "external", or a valid directory name. "internal" (or no DIR value) forces Open MPI to use its internal copy of libltdl. "external" forces Open MPI to use an external installation of libltdl. Supplying a valid directory name also forces Open MPI to use an external installation of libltdl, and adds DIR/include, DIR/lib, and DIR/lib64 to the search path for headers and libraries.])])
|
[Where to find libltdl (this option is ignored if --disable-dlopen is used). DIR can take one of three values: "internal", "external", or a valid directory name. "internal" (or no DIR value) forces Open MPI to use its internal copy of libltdl. "external" forces Open MPI to use an external installation of libltdl. Supplying a valid directory name also forces Open MPI to use an external installation of libltdl, and adds DIR/include, DIR/lib, and DIR/lib64 to the search path for headers and libraries.])])
|
||||||
|
|
||||||
|
#
|
||||||
|
# Checkpoint/restart enabled debugging
|
||||||
|
#
|
||||||
|
AC_MSG_CHECKING([if want checkpoint/restart enabled debugging option])
|
||||||
|
AC_ARG_ENABLE([crdebug],
|
||||||
|
[AC_HELP_STRING([--enable-crdebug],
|
||||||
|
[enable checkpoint/restart debugging functionality (default: disabled)])])
|
||||||
|
|
||||||
|
if test "$ompi_want_ft" = "0"; then
|
||||||
|
ompi_want_prd=0
|
||||||
|
AC_MSG_RESULT([Disabled (fault tolerance disabled --without-ft)])
|
||||||
|
elif test "$enable_crdebug" = "yes"; then
|
||||||
|
ompi_want_prd=1
|
||||||
|
AC_MSG_RESULT([Enabled])
|
||||||
|
else
|
||||||
|
ompi_want_prd=0
|
||||||
|
AC_MSG_RESULT([Disabled])
|
||||||
|
fi
|
||||||
|
|
||||||
|
AC_DEFINE_UNQUOTED([OPAL_ENABLE_CRDEBUG], [$ompi_want_prd],
|
||||||
|
[Whether we want checkpoint/restart enabled debugging functionality or not])
|
||||||
|
|
||||||
])dnl
|
])dnl
|
||||||
|
42
opal/mca/compress/Makefile.am
Обычный файл
42
opal/mca/compress/Makefile.am
Обычный файл
@ -0,0 +1,42 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
include $(top_srcdir)/Makefile.man-page-rules
|
||||||
|
|
||||||
|
# main library setup
|
||||||
|
noinst_LTLIBRARIES = libmca_compress.la
|
||||||
|
libmca_compress_la_SOURCES =
|
||||||
|
|
||||||
|
# header setup
|
||||||
|
nobase_opal_HEADERS =
|
||||||
|
|
||||||
|
# local files
|
||||||
|
headers = compress.h
|
||||||
|
libmca_compress_la_SOURCES += $(headers)
|
||||||
|
|
||||||
|
# Ensure that the man pages are rebuilt if the opal_config.h file
|
||||||
|
# changes; a "good enough" way to know if configure was run again (and
|
||||||
|
# therefore the release date or version may have changed)
|
||||||
|
$(nodist_man_MANS): $(top_builddir)/opal/include/opal_config.h
|
||||||
|
|
||||||
|
# Conditionally install the header files
|
||||||
|
if WANT_INSTALL_HEADERS
|
||||||
|
nobase_opal_HEADERS += $(headers)
|
||||||
|
opaldir = $(includedir)/openmpi/opal/mca/compress
|
||||||
|
else
|
||||||
|
opaldir = $(includedir)
|
||||||
|
endif
|
||||||
|
|
||||||
|
include base/Makefile.am
|
||||||
|
|
||||||
|
distclean-local:
|
||||||
|
rm -f base/static-components.h
|
||||||
|
rm -f $(nodist_man_MANS)
|
21
opal/mca/compress/base/Makefile.am
Обычный файл
21
opal/mca/compress/base/Makefile.am
Обычный файл
@ -0,0 +1,21 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
dist_pkgdata_DATA = base/help-opal-compress-base.txt
|
||||||
|
|
||||||
|
headers += \
|
||||||
|
base/base.h
|
||||||
|
|
||||||
|
libmca_compress_la_SOURCES += \
|
||||||
|
base/compress_base_open.c \
|
||||||
|
base/compress_base_close.c \
|
||||||
|
base/compress_base_select.c \
|
||||||
|
base/compress_base_fns.c
|
76
opal/mca/compress/base/base.h
Обычный файл
76
opal/mca/compress/base/base.h
Обычный файл
@ -0,0 +1,76 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
#ifndef OPAL_COMPRESS_BASE_H
|
||||||
|
#define OPAL_COMPRESS_BASE_H
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/util/opal_environ.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Global functions for MCA overall COMPRESS
|
||||||
|
*/
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Initialize the COMPRESS MCA framework
|
||||||
|
*
|
||||||
|
* @retval OPAL_SUCCESS Upon success
|
||||||
|
* @retval OPAL_ERROR Upon failures
|
||||||
|
*
|
||||||
|
* This function is invoked during opal_init();
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC int opal_compress_base_open(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Select an available component.
|
||||||
|
*
|
||||||
|
* @retval OPAL_SUCCESS Upon Success
|
||||||
|
* @retval OPAL_NOT_FOUND If no component can be selected
|
||||||
|
* @retval OPAL_ERROR Upon other failure
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC int opal_compress_base_select(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Finalize the COMPRESS MCA framework
|
||||||
|
*
|
||||||
|
* @retval OPAL_SUCCESS Upon success
|
||||||
|
* @retval OPAL_ERROR Upon failures
|
||||||
|
*
|
||||||
|
* This function is invoked during opal_finalize();
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC int opal_compress_base_close(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Globals
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC extern int opal_compress_base_output;
|
||||||
|
OPAL_DECLSPEC extern opal_list_t opal_compress_base_components_available;
|
||||||
|
OPAL_DECLSPEC extern opal_compress_base_component_t opal_compress_base_selected_component;
|
||||||
|
OPAL_DECLSPEC extern opal_compress_base_module_t opal_compress;
|
||||||
|
|
||||||
|
/**
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC int opal_compress_base_tar_create(char ** target);
|
||||||
|
OPAL_DECLSPEC int opal_compress_base_tar_extract(char ** target);
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif /* OPAL_COMPRESS_BASE_H */
|
40
opal/mca/compress/base/compress_base_close.c
Обычный файл
40
opal/mca/compress/base/compress_base_close.c
Обычный файл
@ -0,0 +1,40 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include <string.h>
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/base/base.h"
|
||||||
|
#include "opal/include/opal/constants.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
|
||||||
|
int opal_compress_base_close(void)
|
||||||
|
{
|
||||||
|
/* Compression currently only used with C/R */
|
||||||
|
if( !opal_cr_is_enabled ) {
|
||||||
|
opal_output_verbose(10, opal_compress_base_output,
|
||||||
|
"compress:open: FT is not enabled, skipping!");
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Call the component's finalize routine */
|
||||||
|
if( NULL != opal_compress.finalize ) {
|
||||||
|
opal_compress.finalize();
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Close all available modules that are open */
|
||||||
|
mca_base_components_close(opal_compress_base_output,
|
||||||
|
&opal_compress_base_components_available,
|
||||||
|
NULL);
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
142
opal/mca/compress/base/compress_base_fns.c
Обычный файл
142
opal/mca/compress/base/compress_base_fns.c
Обычный файл
@ -0,0 +1,142 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include <string.h>
|
||||||
|
#include <sys/wait.h>
|
||||||
|
#if HAVE_SYS_TYPES_H
|
||||||
|
#include <sys/types.h>
|
||||||
|
#endif
|
||||||
|
#if HAVE_UNISTD_H
|
||||||
|
#include <unistd.h>
|
||||||
|
#endif
|
||||||
|
#ifdef HAVE_FCNTL_H
|
||||||
|
#include <fcntl.h>
|
||||||
|
#endif /* HAVE_FCNTL_H */
|
||||||
|
#ifdef HAVE_SYS_STAT_H
|
||||||
|
#include <sys/stat.h>
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/base/base.h"
|
||||||
|
#include "opal/include/opal/constants.h"
|
||||||
|
#include "opal/util/os_dirpath.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
#include "opal/util/argv.h"
|
||||||
|
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
|
||||||
|
/******************
|
||||||
|
* Local Function Defs
|
||||||
|
******************/
|
||||||
|
|
||||||
|
/******************
|
||||||
|
* Object stuff
|
||||||
|
******************/
|
||||||
|
|
||||||
|
int opal_compress_base_tar_create(char ** target)
|
||||||
|
{
|
||||||
|
int exit_status = OPAL_SUCCESS;
|
||||||
|
char *cmd = NULL;
|
||||||
|
char *tar_target = NULL;
|
||||||
|
char **argv = NULL;
|
||||||
|
pid_t child_pid = 0;
|
||||||
|
int status = 0;
|
||||||
|
|
||||||
|
asprintf(&tar_target, "%s.tar", *target);
|
||||||
|
|
||||||
|
child_pid = fork();
|
||||||
|
if( 0 == child_pid ) { /* Child */
|
||||||
|
asprintf(&cmd, "tar -cf %s %s", tar_target, *target);
|
||||||
|
|
||||||
|
argv = opal_argv_split(cmd, ' ');
|
||||||
|
status = execvp(argv[0], argv);
|
||||||
|
|
||||||
|
opal_output(0, "compress:base: Tar:: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
else if(0 < child_pid) {
|
||||||
|
waitpid(child_pid, &status, 0);
|
||||||
|
|
||||||
|
if( !WIFEXITED(status) ) {
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
free(*target);
|
||||||
|
*target = strdup(tar_target);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
if( NULL != cmd ) {
|
||||||
|
free(cmd);
|
||||||
|
cmd = NULL;
|
||||||
|
}
|
||||||
|
if( NULL != tar_target ) {
|
||||||
|
free(tar_target);
|
||||||
|
tar_target = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_base_tar_extract(char ** target)
|
||||||
|
{
|
||||||
|
int exit_status = OPAL_SUCCESS;
|
||||||
|
char *cmd = NULL;
|
||||||
|
char **argv = NULL;
|
||||||
|
pid_t child_pid = 0;
|
||||||
|
int status = 0;
|
||||||
|
|
||||||
|
child_pid = fork();
|
||||||
|
if( 0 == child_pid ) { /* Child */
|
||||||
|
asprintf(&cmd, "tar -xf %s", *target);
|
||||||
|
|
||||||
|
argv = opal_argv_split(cmd, ' ');
|
||||||
|
status = execvp(argv[0], argv);
|
||||||
|
|
||||||
|
opal_output(0, "compress:base: Tar:: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
else if(0 < child_pid) {
|
||||||
|
waitpid(child_pid, &status, 0);
|
||||||
|
|
||||||
|
if( !WIFEXITED(status) ) {
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Strip off the '.tar' */
|
||||||
|
(*target)[strlen(*target)-4] = '\0';
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
if( NULL != cmd ) {
|
||||||
|
free(cmd);
|
||||||
|
cmd = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
/******************
|
||||||
|
* Local Functions
|
||||||
|
******************/
|
99
opal/mca/compress/base/compress_base_open.c
Обычный файл
99
opal/mca/compress/base/compress_base_open.c
Обычный файл
@ -0,0 +1,99 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include <string.h>
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/base/base.h"
|
||||||
|
#include "opal/include/opal/constants.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
|
||||||
|
#include "opal/mca/compress/base/static-components.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Globals
|
||||||
|
*/
|
||||||
|
int opal_compress_base_output = -1;
|
||||||
|
opal_compress_base_module_t opal_compress = {
|
||||||
|
NULL, /* init */
|
||||||
|
NULL, /* finalize */
|
||||||
|
NULL, /* compress */
|
||||||
|
NULL, /* compress_nb */
|
||||||
|
NULL, /* decompress */
|
||||||
|
NULL /* decompress_nb */
|
||||||
|
};
|
||||||
|
opal_list_t opal_compress_base_components_available;
|
||||||
|
opal_compress_base_component_t opal_compress_base_selected_component;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Function for finding and opening either all MCA components,
|
||||||
|
* or the one that was specifically requested via a MCA parameter.
|
||||||
|
*/
|
||||||
|
int opal_compress_base_open(void)
|
||||||
|
{
|
||||||
|
int ret, exit_status = OPAL_SUCCESS;
|
||||||
|
int value;
|
||||||
|
char *str_value = NULL;
|
||||||
|
|
||||||
|
/* Debugging/Verbose output */
|
||||||
|
mca_base_param_reg_int_name("compress",
|
||||||
|
"base_verbose",
|
||||||
|
"Verbosity level of the COMPRESS framework",
|
||||||
|
false, false,
|
||||||
|
0, &value);
|
||||||
|
if(0 != value) {
|
||||||
|
opal_compress_base_output = opal_output_open(NULL);
|
||||||
|
} else {
|
||||||
|
opal_compress_base_output = -1;
|
||||||
|
}
|
||||||
|
opal_output_set_verbosity(opal_compress_base_output, value);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Which COMPRESS component to open
|
||||||
|
* - NULL or "" = auto-select
|
||||||
|
* - "none" = Empty component
|
||||||
|
* - ow. select that specific component
|
||||||
|
*/
|
||||||
|
mca_base_param_reg_string_name("compress", NULL,
|
||||||
|
"Which COMPRESS component to use (empty = auto-select)",
|
||||||
|
false, false,
|
||||||
|
NULL, &str_value);
|
||||||
|
|
||||||
|
/* Compression currently only used with C/R */
|
||||||
|
if( !opal_cr_is_enabled ) {
|
||||||
|
opal_output_verbose(10, opal_compress_base_output,
|
||||||
|
"compress:open: FT is not enabled, skipping!");
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Open up all available components */
|
||||||
|
if (OPAL_SUCCESS != (ret = mca_base_components_open("compress",
|
||||||
|
opal_compress_base_output,
|
||||||
|
mca_compress_base_static_components,
|
||||||
|
&opal_compress_base_components_available,
|
||||||
|
true)) ) {
|
||||||
|
if( OPAL_ERR_NOT_FOUND == ret &&
|
||||||
|
NULL != str_value &&
|
||||||
|
0 == strncmp(str_value, "none", strlen("none")) ) {
|
||||||
|
exit_status = OPAL_SUCCESS;
|
||||||
|
} else {
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != str_value ) {
|
||||||
|
free(str_value);
|
||||||
|
}
|
||||||
|
return exit_status;
|
||||||
|
}
|
65
opal/mca/compress/base/compress_base_select.c
Обычный файл
65
opal/mca/compress/base/compress_base_select.c
Обычный файл
@ -0,0 +1,65 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#ifdef HAVE_UNISTD_H
|
||||||
|
#include "unistd.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#include "opal/include/opal/constants.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/base/base.h"
|
||||||
|
#include "opal/mca/base/mca_base_param.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
|
||||||
|
int opal_compress_base_select(void)
|
||||||
|
{
|
||||||
|
int ret, exit_status = OPAL_SUCCESS;
|
||||||
|
opal_compress_base_component_t *best_component = NULL;
|
||||||
|
opal_compress_base_module_t *best_module = NULL;
|
||||||
|
|
||||||
|
/* Compression currently only used with C/R */
|
||||||
|
if( !opal_cr_is_enabled ) {
|
||||||
|
opal_output_verbose(10, opal_compress_base_output,
|
||||||
|
"compress:open: FT is not enabled, skipping!");
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Select the best component
|
||||||
|
*/
|
||||||
|
if( OPAL_SUCCESS != mca_base_select("compress", opal_compress_base_output,
|
||||||
|
&opal_compress_base_components_available,
|
||||||
|
(mca_base_module_t **) &best_module,
|
||||||
|
(mca_base_component_t **) &best_component) ) {
|
||||||
|
/* This will only happen if no component was selected */
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Save the winner */
|
||||||
|
opal_compress_base_selected_component = *best_component;
|
||||||
|
opal_compress = *best_module;
|
||||||
|
|
||||||
|
/* Initialize the winner */
|
||||||
|
if (NULL != best_module) {
|
||||||
|
if (OPAL_SUCCESS != (ret = opal_compress.init()) ) {
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
return exit_status;
|
||||||
|
}
|
13
opal/mca/compress/base/help-opal-compress-base.txt
Обычный файл
13
opal/mca/compress/base/help-opal-compress-base.txt
Обычный файл
@ -0,0 +1,13 @@
|
|||||||
|
-*- text -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
# This is the US/English general help file for Open PAL Compress framework.
|
||||||
|
#
|
40
opal/mca/compress/bzip/Makefile.am
Обычный файл
40
opal/mca/compress/bzip/Makefile.am
Обычный файл
@ -0,0 +1,40 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
AM_CPPFLAGS = \
|
||||||
|
$(LTDLINCL)
|
||||||
|
|
||||||
|
dist_pkgdata_DATA = help-opal-compress-bzip.txt
|
||||||
|
|
||||||
|
sources = \
|
||||||
|
compress_bzip.h \
|
||||||
|
compress_bzip_component.c \
|
||||||
|
compress_bzip_module.c
|
||||||
|
|
||||||
|
# Make the output library in this directory, and name it either
|
||||||
|
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||||
|
# (for static builds).
|
||||||
|
|
||||||
|
if OMPI_BUILD_compress_bzip_DSO
|
||||||
|
component_noinst =
|
||||||
|
component_install = mca_compress_bzip.la
|
||||||
|
else
|
||||||
|
component_noinst = libmca_compress_bzip.la
|
||||||
|
component_install =
|
||||||
|
endif
|
||||||
|
|
||||||
|
mcacomponentdir = $(pkglibdir)
|
||||||
|
mcacomponent_LTLIBRARIES = $(component_install)
|
||||||
|
mca_compress_bzip_la_SOURCES = $(sources)
|
||||||
|
mca_compress_bzip_la_LDFLAGS = -module -avoid-version
|
||||||
|
|
||||||
|
noinst_LTLIBRARIES = $(component_noinst)
|
||||||
|
libmca_compress_bzip_la_SOURCES = $(sources)
|
||||||
|
libmca_compress_bzip_la_LDFLAGS = -module -avoid-version
|
63
opal/mca/compress/bzip/compress_bzip.h
Обычный файл
63
opal/mca/compress/bzip/compress_bzip.h
Обычный файл
@ -0,0 +1,63 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file
|
||||||
|
*
|
||||||
|
* BZIP COMPRESS component
|
||||||
|
*
|
||||||
|
* Uses the bzip library
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef MCA_COMPRESS_BZIP_EXPORT_H
|
||||||
|
#define MCA_COMPRESS_BZIP_EXPORT_H
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local Component structures
|
||||||
|
*/
|
||||||
|
struct opal_compress_bzip_component_t {
|
||||||
|
opal_compress_base_component_t super; /** Base COMPRESS component */
|
||||||
|
|
||||||
|
};
|
||||||
|
typedef struct opal_compress_bzip_component_t opal_compress_bzip_component_t;
|
||||||
|
OPAL_MODULE_DECLSPEC extern opal_compress_bzip_component_t mca_compress_bzip_component;
|
||||||
|
|
||||||
|
int opal_compress_bzip_component_query(mca_base_module_t **module, int *priority);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Module functions
|
||||||
|
*/
|
||||||
|
int opal_compress_bzip_module_init(void);
|
||||||
|
int opal_compress_bzip_module_finalize(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Actual funcationality
|
||||||
|
*/
|
||||||
|
int opal_compress_bzip_compress(char *fname, char **cname, char **postfix);
|
||||||
|
int opal_compress_bzip_compress_nb(char *fname, char **cname, char **postfix, pid_t *child_pid);
|
||||||
|
int opal_compress_bzip_decompress(char *cname, char **fname);
|
||||||
|
int opal_compress_bzip_decompress_nb(char *cname, char **fname, pid_t *child_pid);
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif /* MCA_COMPRESS_BZIP_EXPORT_H */
|
138
opal/mca/compress/bzip/compress_bzip_component.c
Обычный файл
138
opal/mca/compress/bzip/compress_bzip_component.c
Обычный файл
@ -0,0 +1,138 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include "opal/constants.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
#include "compress_bzip.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Public string for version number
|
||||||
|
*/
|
||||||
|
const char *opal_compress_bzip_component_version_string =
|
||||||
|
"OPAL COMPRESS bzip MCA component version " OPAL_VERSION;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local functionality
|
||||||
|
*/
|
||||||
|
static int compress_bzip_open(void);
|
||||||
|
static int compress_bzip_close(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Instantiate the public struct with all of our public information
|
||||||
|
* and pointer to our public functions in it
|
||||||
|
*/
|
||||||
|
opal_compress_bzip_component_t mca_compress_bzip_component = {
|
||||||
|
/* First do the base component stuff */
|
||||||
|
{
|
||||||
|
/* Handle the general mca_component_t struct containing
|
||||||
|
* meta information about the component itbzip
|
||||||
|
*/
|
||||||
|
{
|
||||||
|
OPAL_COMPRESS_BASE_VERSION_2_0_0,
|
||||||
|
|
||||||
|
/* Component name and version */
|
||||||
|
"bzip",
|
||||||
|
OPAL_MAJOR_VERSION,
|
||||||
|
OPAL_MINOR_VERSION,
|
||||||
|
OPAL_RELEASE_VERSION,
|
||||||
|
|
||||||
|
/* Component open and close functions */
|
||||||
|
compress_bzip_open,
|
||||||
|
compress_bzip_close,
|
||||||
|
opal_compress_bzip_component_query
|
||||||
|
},
|
||||||
|
{
|
||||||
|
/* The component is checkpoint ready */
|
||||||
|
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||||
|
},
|
||||||
|
|
||||||
|
/* Verbosity level */
|
||||||
|
0,
|
||||||
|
/* opal_output handler */
|
||||||
|
-1,
|
||||||
|
/* Default priority */
|
||||||
|
10
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Bzip module
|
||||||
|
*/
|
||||||
|
static opal_compress_base_module_t loc_module = {
|
||||||
|
/** Initialization Function */
|
||||||
|
opal_compress_bzip_module_init,
|
||||||
|
/** Finalization Function */
|
||||||
|
opal_compress_bzip_module_finalize,
|
||||||
|
|
||||||
|
/** Compress Function */
|
||||||
|
opal_compress_bzip_compress,
|
||||||
|
opal_compress_bzip_compress_nb,
|
||||||
|
|
||||||
|
/** Decompress Function */
|
||||||
|
opal_compress_bzip_decompress,
|
||||||
|
opal_compress_bzip_decompress_nb
|
||||||
|
};
|
||||||
|
|
||||||
|
static int compress_bzip_open(void)
|
||||||
|
{
|
||||||
|
mca_base_param_reg_int(&mca_compress_bzip_component.super.base_version,
|
||||||
|
"priority",
|
||||||
|
"Priority of the COMPRESS bzip component",
|
||||||
|
false, false,
|
||||||
|
mca_compress_bzip_component.super.priority,
|
||||||
|
&mca_compress_bzip_component.super.priority);
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_compress_bzip_component.super.base_version,
|
||||||
|
"verbose",
|
||||||
|
"Verbose level for the COMPRESS bzip component",
|
||||||
|
false, false,
|
||||||
|
mca_compress_bzip_component.super.verbose,
|
||||||
|
&mca_compress_bzip_component.super.verbose);
|
||||||
|
/* If there is a custom verbose level for this component than use it
|
||||||
|
* otherwise take our parents level and output channel
|
||||||
|
*/
|
||||||
|
if ( 0 != mca_compress_bzip_component.super.verbose) {
|
||||||
|
mca_compress_bzip_component.super.output_handle = opal_output_open(NULL);
|
||||||
|
opal_output_set_verbosity(mca_compress_bzip_component.super.output_handle,
|
||||||
|
mca_compress_bzip_component.super.verbose);
|
||||||
|
} else {
|
||||||
|
mca_compress_bzip_component.super.output_handle = opal_compress_base_output;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Debug output
|
||||||
|
*/
|
||||||
|
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: open()");
|
||||||
|
opal_output_verbose(20, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: open: priority = %d",
|
||||||
|
mca_compress_bzip_component.super.priority);
|
||||||
|
opal_output_verbose(20, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: open: verbosity = %d",
|
||||||
|
mca_compress_bzip_component.super.verbose);
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
static int compress_bzip_close(void)
|
||||||
|
{
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_bzip_component_query(mca_base_module_t **module, int *priority)
|
||||||
|
{
|
||||||
|
*module = (mca_base_module_t *)&loc_module;
|
||||||
|
*priority = mca_compress_bzip_component.super.priority;
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
247
opal/mca/compress/bzip/compress_bzip_module.c
Обычный файл
247
opal/mca/compress/bzip/compress_bzip_module.c
Обычный файл
@ -0,0 +1,247 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include <string.h>
|
||||||
|
#include <sys/types.h>
|
||||||
|
#include <sys/wait.h>
|
||||||
|
#include <sys/stat.h>
|
||||||
|
#if HAVE_UNISTD_H
|
||||||
|
#include <unistd.h>
|
||||||
|
#endif /* HAVE_UNISTD_H */
|
||||||
|
|
||||||
|
#include "opal/util/opal_environ.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
#include "opal/util/show_help.h"
|
||||||
|
#include "opal/util/argv.h"
|
||||||
|
#include "opal/util/opal_environ.h"
|
||||||
|
|
||||||
|
#include "opal/constants.h"
|
||||||
|
#include "opal/mca/base/mca_base_param.h"
|
||||||
|
#include "opal/util/basename.h"
|
||||||
|
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
|
||||||
|
#include "compress_bzip.h"
|
||||||
|
|
||||||
|
static bool is_directory(char *fname );
|
||||||
|
|
||||||
|
int opal_compress_bzip_module_init(void)
|
||||||
|
{
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_bzip_module_finalize(void)
|
||||||
|
{
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_bzip_compress(char * fname, char **cname, char **postfix)
|
||||||
|
{
|
||||||
|
int child_pid = 0;
|
||||||
|
int status = 0;
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: compress(%s)",
|
||||||
|
fname);
|
||||||
|
|
||||||
|
opal_compress_bzip_compress_nb(fname, cname, postfix, &child_pid);
|
||||||
|
waitpid(child_pid, &status, 0);
|
||||||
|
|
||||||
|
if( WIFEXITED(status) ) {
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
} else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_bzip_compress_nb(char * fname, char **cname, char **postfix, pid_t *child_pid)
|
||||||
|
{
|
||||||
|
char * cmd = NULL;
|
||||||
|
char **argv = NULL;
|
||||||
|
char * base_fname = NULL;
|
||||||
|
char * dir_fname = NULL;
|
||||||
|
int status;
|
||||||
|
bool is_dir;
|
||||||
|
|
||||||
|
is_dir = is_directory(fname);
|
||||||
|
|
||||||
|
*child_pid = fork();
|
||||||
|
if( *child_pid == 0 ) { /* Child */
|
||||||
|
|
||||||
|
dir_fname = opal_dirname(fname);
|
||||||
|
base_fname = opal_basename(fname);
|
||||||
|
|
||||||
|
chdir(dir_fname);
|
||||||
|
|
||||||
|
if( is_dir ) {
|
||||||
|
#if 0
|
||||||
|
opal_compress_base_tar_create(&base_fname);
|
||||||
|
asprintf(cname, "%s.bz2", base_fname);
|
||||||
|
asprintf(&cmd, "bzip2 %s", base_fname);
|
||||||
|
#else
|
||||||
|
asprintf(cname, "%s.tar.bz2", base_fname);
|
||||||
|
asprintf(&cmd, "tar -jcf %s %s", *cname, base_fname);
|
||||||
|
#endif
|
||||||
|
} else {
|
||||||
|
asprintf(cname, "%s.bz2", base_fname);
|
||||||
|
asprintf(&cmd, "bzip2 %s", base_fname);
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: compress_nb(%s -> [%s])",
|
||||||
|
fname, *cname);
|
||||||
|
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: compress_nb() command [%s]",
|
||||||
|
cmd);
|
||||||
|
|
||||||
|
argv = opal_argv_split(cmd, ' ');
|
||||||
|
status = execvp(argv[0], argv);
|
||||||
|
|
||||||
|
opal_output(0, "compress:bzip: compress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
else if( *child_pid > 0 ) {
|
||||||
|
if( is_dir ) {
|
||||||
|
*postfix = strdup(".tar.bz2");
|
||||||
|
} else {
|
||||||
|
*postfix = strdup(".bz2");
|
||||||
|
}
|
||||||
|
asprintf(cname, "%s%s", fname, *postfix);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != cmd ) {
|
||||||
|
free(cmd);
|
||||||
|
cmd = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_bzip_decompress(char * cname, char **fname)
|
||||||
|
{
|
||||||
|
int child_pid = 0;
|
||||||
|
int status = 0;
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: decompress(%s)",
|
||||||
|
cname);
|
||||||
|
|
||||||
|
opal_compress_bzip_decompress_nb(cname, fname, &child_pid);
|
||||||
|
waitpid(child_pid, &status, 0);
|
||||||
|
|
||||||
|
if( WIFEXITED(status) ) {
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
} else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_bzip_decompress_nb(char * cname, char **fname, pid_t *child_pid)
|
||||||
|
{
|
||||||
|
char * cmd = NULL;
|
||||||
|
char **argv = NULL;
|
||||||
|
char * dir_cname = NULL;
|
||||||
|
pid_t loc_pid = 0;
|
||||||
|
int status;
|
||||||
|
bool is_tar;
|
||||||
|
|
||||||
|
if( 0 == strncmp(&(cname[strlen(cname)-8]), ".tar.bz2", strlen(".tar.bz2")) ) {
|
||||||
|
is_tar = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
*fname = strdup(cname);
|
||||||
|
if( is_tar ) {
|
||||||
|
(*fname)[strlen(cname)-8] = '\0';
|
||||||
|
} else {
|
||||||
|
(*fname)[strlen(cname)-4] = '\0';
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: decompress_nb(%s -> [%s])",
|
||||||
|
cname, *fname);
|
||||||
|
|
||||||
|
*child_pid = fork();
|
||||||
|
if( *child_pid == 0 ) { /* Child */
|
||||||
|
dir_cname = opal_dirname(cname);
|
||||||
|
|
||||||
|
chdir(dir_cname);
|
||||||
|
|
||||||
|
/* Fork(bunzip) */
|
||||||
|
loc_pid = fork();
|
||||||
|
if( loc_pid == 0 ) { /* Child */
|
||||||
|
asprintf(&cmd, "bunzip2 %s", cname);
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||||
|
"compress:bzip: decompress_nb() command [%s]",
|
||||||
|
cmd);
|
||||||
|
|
||||||
|
argv = opal_argv_split(cmd, ' ');
|
||||||
|
status = execvp(argv[0], argv);
|
||||||
|
|
||||||
|
opal_output(0, "compress:bzip: decompress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
else if( loc_pid > 0 ) { /* Parent */
|
||||||
|
waitpid(loc_pid, &status, 0);
|
||||||
|
if( !WIFEXITED(status) ) {
|
||||||
|
opal_output(0, "compress:bzip: decompress_nb: Failed to bunzip the file [%s] status = %d\n", cname, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* tar_decompress */
|
||||||
|
if( is_tar ) {
|
||||||
|
/* Strip off '.bz2' leaving just '.tar' */
|
||||||
|
cname[strlen(cname)-4] = '\0';
|
||||||
|
opal_compress_base_tar_extract(&cname);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Once this child is done, then directly exit */
|
||||||
|
exit(OPAL_SUCCESS);
|
||||||
|
}
|
||||||
|
else if( *child_pid > 0 ) {
|
||||||
|
;
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != cmd ) {
|
||||||
|
free(cmd);
|
||||||
|
cmd = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
static bool is_directory(char *fname ) {
|
||||||
|
struct stat file_status;
|
||||||
|
int rc;
|
||||||
|
|
||||||
|
if(0 != (rc = stat(fname, &file_status) ) ) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if(S_ISDIR(file_status.st_mode)) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
13
opal/mca/compress/bzip/configure.params
Обычный файл
13
opal/mca/compress/bzip/configure.params
Обычный файл
@ -0,0 +1,13 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
PARAM_INIT_FILE=compress_bzip_component.c
|
||||||
|
PARAM_CONFIG_FILES="Makefile"
|
13
opal/mca/compress/bzip/help-opal-compress-bzip.txt
Обычный файл
13
opal/mca/compress/bzip/help-opal-compress-bzip.txt
Обычный файл
@ -0,0 +1,13 @@
|
|||||||
|
-*- text -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
# This is the US/English general help file for Open PAL Compress framework.
|
||||||
|
#
|
135
opal/mca/compress/compress.h
Обычный файл
135
opal/mca/compress/compress.h
Обычный файл
@ -0,0 +1,135 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
* University Research and Technology
|
||||||
|
* Corporation. All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
/**
|
||||||
|
* @file
|
||||||
|
*
|
||||||
|
* Compression Framework
|
||||||
|
*
|
||||||
|
* General Description:
|
||||||
|
*
|
||||||
|
* The OPAL Compress framework has been created to provide an abstract interface
|
||||||
|
* to the compression agent library on the host machine. This fromework is useful
|
||||||
|
* when distributing files that can be compressed before sending to dimish the
|
||||||
|
* load on the network.
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef MCA_COMPRESS_H
|
||||||
|
#define MCA_COMPRESS_H
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/base/base.h"
|
||||||
|
#include "opal/class/opal_object.h"
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Module initialization function.
|
||||||
|
* Returns OPAL_SUCCESS
|
||||||
|
*/
|
||||||
|
typedef int (*opal_compress_base_module_init_fn_t)
|
||||||
|
(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Module finalization function.
|
||||||
|
* Returns OPAL_SUCCESS
|
||||||
|
*/
|
||||||
|
typedef int (*opal_compress_base_module_finalize_fn_t)
|
||||||
|
(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Compress the file provided
|
||||||
|
*
|
||||||
|
* Arguments:
|
||||||
|
* fname = Filename to compress
|
||||||
|
* cname = Compressed filename
|
||||||
|
* postfix = postfix added to filename to create compressed filename
|
||||||
|
* Returns:
|
||||||
|
* OPAL_SUCCESS on success, ow OPAL_ERROR
|
||||||
|
*/
|
||||||
|
typedef int (*opal_compress_base_module_compress_fn_t)
|
||||||
|
(char * fname, char **cname, char **postfix);
|
||||||
|
|
||||||
|
typedef int (*opal_compress_base_module_compress_nb_fn_t)
|
||||||
|
(char * fname, char **cname, char **postfix, pid_t *child_pid);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Decompress the file provided
|
||||||
|
*
|
||||||
|
* Arguments:
|
||||||
|
* fname = Filename to compress
|
||||||
|
* cname = Compressed filename
|
||||||
|
* Returns:
|
||||||
|
* OPAL_SUCCESS on success, ow OPAL_ERROR
|
||||||
|
*/
|
||||||
|
typedef int (*opal_compress_base_module_decompress_fn_t)
|
||||||
|
(char * cname, char **fname);
|
||||||
|
typedef int (*opal_compress_base_module_decompress_nb_fn_t)
|
||||||
|
(char * cname, char **fname, pid_t *child_pid);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Structure for COMPRESS components.
|
||||||
|
*/
|
||||||
|
struct opal_compress_base_component_2_0_0_t {
|
||||||
|
/** MCA base component */
|
||||||
|
mca_base_component_t base_version;
|
||||||
|
/** MCA base data */
|
||||||
|
mca_base_component_data_t base_data;
|
||||||
|
|
||||||
|
/** Verbosity Level */
|
||||||
|
int verbose;
|
||||||
|
/** Output Handle for opal_output */
|
||||||
|
int output_handle;
|
||||||
|
/** Default Priority */
|
||||||
|
int priority;
|
||||||
|
};
|
||||||
|
typedef struct opal_compress_base_component_2_0_0_t opal_compress_base_component_2_0_0_t;
|
||||||
|
typedef struct opal_compress_base_component_2_0_0_t opal_compress_base_component_t;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Structure for COMPRESS modules
|
||||||
|
*/
|
||||||
|
struct opal_compress_base_module_1_0_0_t {
|
||||||
|
/** Initialization Function */
|
||||||
|
opal_compress_base_module_init_fn_t init;
|
||||||
|
/** Finalization Function */
|
||||||
|
opal_compress_base_module_finalize_fn_t finalize;
|
||||||
|
|
||||||
|
/** Compress interface */
|
||||||
|
opal_compress_base_module_compress_fn_t compress;
|
||||||
|
opal_compress_base_module_compress_nb_fn_t compress_nb;
|
||||||
|
|
||||||
|
/** Decompress Interface */
|
||||||
|
opal_compress_base_module_decompress_fn_t decompress;
|
||||||
|
opal_compress_base_module_decompress_nb_fn_t decompress_nb;
|
||||||
|
};
|
||||||
|
typedef struct opal_compress_base_module_1_0_0_t opal_compress_base_module_1_0_0_t;
|
||||||
|
typedef struct opal_compress_base_module_1_0_0_t opal_compress_base_module_t;
|
||||||
|
|
||||||
|
OPAL_DECLSPEC extern opal_compress_base_module_t opal_compress;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Macro for use in components that are of type COMPRESS
|
||||||
|
*/
|
||||||
|
#define OPAL_COMPRESS_BASE_VERSION_2_0_0 \
|
||||||
|
MCA_BASE_VERSION_2_0_0, \
|
||||||
|
"compress", 2, 0, 0
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif /* OPAL_COMPRESS_H */
|
||||||
|
|
40
opal/mca/compress/gzip/Makefile.am
Обычный файл
40
opal/mca/compress/gzip/Makefile.am
Обычный файл
@ -0,0 +1,40 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
AM_CPPFLAGS = \
|
||||||
|
$(LTDLINCL)
|
||||||
|
|
||||||
|
dist_pkgdata_DATA = help-opal-compress-gzip.txt
|
||||||
|
|
||||||
|
sources = \
|
||||||
|
compress_gzip.h \
|
||||||
|
compress_gzip_component.c \
|
||||||
|
compress_gzip_module.c
|
||||||
|
|
||||||
|
# Make the output library in this directory, and name it either
|
||||||
|
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||||
|
# (for static builds).
|
||||||
|
|
||||||
|
if OMPI_BUILD_compress_gzip_DSO
|
||||||
|
component_noinst =
|
||||||
|
component_install = mca_compress_gzip.la
|
||||||
|
else
|
||||||
|
component_noinst = libmca_compress_gzip.la
|
||||||
|
component_install =
|
||||||
|
endif
|
||||||
|
|
||||||
|
mcacomponentdir = $(pkglibdir)
|
||||||
|
mcacomponent_LTLIBRARIES = $(component_install)
|
||||||
|
mca_compress_gzip_la_SOURCES = $(sources)
|
||||||
|
mca_compress_gzip_la_LDFLAGS = -module -avoid-version
|
||||||
|
|
||||||
|
noinst_LTLIBRARIES = $(component_noinst)
|
||||||
|
libmca_compress_gzip_la_SOURCES = $(sources)
|
||||||
|
libmca_compress_gzip_la_LDFLAGS = -module -avoid-version
|
63
opal/mca/compress/gzip/compress_gzip.h
Обычный файл
63
opal/mca/compress/gzip/compress_gzip.h
Обычный файл
@ -0,0 +1,63 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file
|
||||||
|
*
|
||||||
|
* GZIP COMPRESS component
|
||||||
|
*
|
||||||
|
* Uses the gzip library
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef MCA_COMPRESS_GZIP_EXPORT_H
|
||||||
|
#define MCA_COMPRESS_GZIP_EXPORT_H
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local Component structures
|
||||||
|
*/
|
||||||
|
struct opal_compress_gzip_component_t {
|
||||||
|
opal_compress_base_component_t super; /** Base COMPRESS component */
|
||||||
|
|
||||||
|
};
|
||||||
|
typedef struct opal_compress_gzip_component_t opal_compress_gzip_component_t;
|
||||||
|
OPAL_MODULE_DECLSPEC extern opal_compress_gzip_component_t mca_compress_gzip_component;
|
||||||
|
|
||||||
|
int opal_compress_gzip_component_query(mca_base_module_t **module, int *priority);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Module functions
|
||||||
|
*/
|
||||||
|
int opal_compress_gzip_module_init(void);
|
||||||
|
int opal_compress_gzip_module_finalize(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Actual funcationality
|
||||||
|
*/
|
||||||
|
int opal_compress_gzip_compress(char *fname, char **cname, char **postfix);
|
||||||
|
int opal_compress_gzip_compress_nb(char *fname, char **cname, char **postfix, pid_t *child_pid);
|
||||||
|
int opal_compress_gzip_decompress(char *cname, char **fname);
|
||||||
|
int opal_compress_gzip_decompress_nb(char *cname, char **fname, pid_t *child_pid);
|
||||||
|
|
||||||
|
#if defined(c_plusplus) || defined(__cplusplus)
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif /* MCA_COMPRESS_GZIP_EXPORT_H */
|
138
opal/mca/compress/gzip/compress_gzip_component.c
Обычный файл
138
opal/mca/compress/gzip/compress_gzip_component.c
Обычный файл
@ -0,0 +1,138 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include "opal/constants.h"
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
#include "compress_gzip.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Public string for version number
|
||||||
|
*/
|
||||||
|
const char *opal_compress_gzip_component_version_string =
|
||||||
|
"OPAL COMPRESS gzip MCA component version " OPAL_VERSION;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local functionality
|
||||||
|
*/
|
||||||
|
static int compress_gzip_open(void);
|
||||||
|
static int compress_gzip_close(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Instantiate the public struct with all of our public information
|
||||||
|
* and pointer to our public functions in it
|
||||||
|
*/
|
||||||
|
opal_compress_gzip_component_t mca_compress_gzip_component = {
|
||||||
|
/* First do the base component stuff */
|
||||||
|
{
|
||||||
|
/* Handle the general mca_component_t struct containing
|
||||||
|
* meta information about the component itgzip
|
||||||
|
*/
|
||||||
|
{
|
||||||
|
OPAL_COMPRESS_BASE_VERSION_2_0_0,
|
||||||
|
|
||||||
|
/* Component name and version */
|
||||||
|
"gzip",
|
||||||
|
OPAL_MAJOR_VERSION,
|
||||||
|
OPAL_MINOR_VERSION,
|
||||||
|
OPAL_RELEASE_VERSION,
|
||||||
|
|
||||||
|
/* Component open and close functions */
|
||||||
|
compress_gzip_open,
|
||||||
|
compress_gzip_close,
|
||||||
|
opal_compress_gzip_component_query
|
||||||
|
},
|
||||||
|
{
|
||||||
|
/* The component is checkpoint ready */
|
||||||
|
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||||
|
},
|
||||||
|
|
||||||
|
/* Verbosity level */
|
||||||
|
0,
|
||||||
|
/* opal_output handler */
|
||||||
|
-1,
|
||||||
|
/* Default priority */
|
||||||
|
15
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Gzip module
|
||||||
|
*/
|
||||||
|
static opal_compress_base_module_t loc_module = {
|
||||||
|
/** Initialization Function */
|
||||||
|
opal_compress_gzip_module_init,
|
||||||
|
/** Finalization Function */
|
||||||
|
opal_compress_gzip_module_finalize,
|
||||||
|
|
||||||
|
/** Compress Function */
|
||||||
|
opal_compress_gzip_compress,
|
||||||
|
opal_compress_gzip_compress_nb,
|
||||||
|
|
||||||
|
/** Decompress Function */
|
||||||
|
opal_compress_gzip_decompress,
|
||||||
|
opal_compress_gzip_decompress_nb
|
||||||
|
};
|
||||||
|
|
||||||
|
static int compress_gzip_open(void)
|
||||||
|
{
|
||||||
|
mca_base_param_reg_int(&mca_compress_gzip_component.super.base_version,
|
||||||
|
"priority",
|
||||||
|
"Priority of the COMPRESS gzip component",
|
||||||
|
false, false,
|
||||||
|
mca_compress_gzip_component.super.priority,
|
||||||
|
&mca_compress_gzip_component.super.priority);
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_compress_gzip_component.super.base_version,
|
||||||
|
"verbose",
|
||||||
|
"Verbose level for the COMPRESS gzip component",
|
||||||
|
false, false,
|
||||||
|
mca_compress_gzip_component.super.verbose,
|
||||||
|
&mca_compress_gzip_component.super.verbose);
|
||||||
|
/* If there is a custom verbose level for this component than use it
|
||||||
|
* otherwise take our parents level and output channel
|
||||||
|
*/
|
||||||
|
if ( 0 != mca_compress_gzip_component.super.verbose) {
|
||||||
|
mca_compress_gzip_component.super.output_handle = opal_output_open(NULL);
|
||||||
|
opal_output_set_verbosity(mca_compress_gzip_component.super.output_handle,
|
||||||
|
mca_compress_gzip_component.super.verbose);
|
||||||
|
} else {
|
||||||
|
mca_compress_gzip_component.super.output_handle = opal_compress_base_output;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Debug output
|
||||||
|
*/
|
||||||
|
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: open()");
|
||||||
|
opal_output_verbose(20, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: open: priority = %d",
|
||||||
|
mca_compress_gzip_component.super.priority);
|
||||||
|
opal_output_verbose(20, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: open: verbosity = %d",
|
||||||
|
mca_compress_gzip_component.super.verbose);
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
static int compress_gzip_close(void)
|
||||||
|
{
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_gzip_component_query(mca_base_module_t **module, int *priority)
|
||||||
|
{
|
||||||
|
*module = (mca_base_module_t *)&loc_module;
|
||||||
|
*priority = mca_compress_gzip_component.super.priority;
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
250
opal/mca/compress/gzip/compress_gzip_module.c
Обычный файл
250
opal/mca/compress/gzip/compress_gzip_module.c
Обычный файл
@ -0,0 +1,250 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "opal_config.h"
|
||||||
|
|
||||||
|
#include <string.h>
|
||||||
|
#include <sys/types.h>
|
||||||
|
#include <sys/wait.h>
|
||||||
|
#include <sys/stat.h>
|
||||||
|
#if HAVE_UNISTD_H
|
||||||
|
#include <unistd.h>
|
||||||
|
#endif /* HAVE_UNISTD_H */
|
||||||
|
|
||||||
|
#include "opal/util/opal_environ.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
#include "opal/util/show_help.h"
|
||||||
|
#include "opal/util/argv.h"
|
||||||
|
#include "opal/util/opal_environ.h"
|
||||||
|
|
||||||
|
#include "opal/constants.h"
|
||||||
|
#include "opal/mca/base/mca_base_param.h"
|
||||||
|
#include "opal/util/basename.h"
|
||||||
|
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
|
||||||
|
#include "compress_gzip.h"
|
||||||
|
|
||||||
|
static bool is_directory(char *fname );
|
||||||
|
|
||||||
|
int opal_compress_gzip_module_init(void)
|
||||||
|
{
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_gzip_module_finalize(void)
|
||||||
|
{
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_gzip_compress(char * fname, char **cname, char **postfix)
|
||||||
|
{
|
||||||
|
int child_pid = 0;
|
||||||
|
int status = 0;
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: compress(%s)",
|
||||||
|
fname);
|
||||||
|
|
||||||
|
opal_compress_gzip_compress_nb(fname, cname, postfix, &child_pid);
|
||||||
|
waitpid(child_pid, &status, 0);
|
||||||
|
|
||||||
|
if( WIFEXITED(status) ) {
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
} else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_gzip_compress_nb(char * fname, char **cname, char **postfix, pid_t *child_pid)
|
||||||
|
{
|
||||||
|
char * cmd = NULL;
|
||||||
|
char **argv = NULL;
|
||||||
|
char * base_fname = NULL;
|
||||||
|
char * dir_fname = NULL;
|
||||||
|
int status;
|
||||||
|
bool is_dir;
|
||||||
|
|
||||||
|
is_dir = is_directory(fname);
|
||||||
|
|
||||||
|
*child_pid = fork();
|
||||||
|
if( *child_pid == 0 ) { /* Child */
|
||||||
|
|
||||||
|
dir_fname = opal_dirname(fname);
|
||||||
|
base_fname = opal_basename(fname);
|
||||||
|
|
||||||
|
chdir(dir_fname);
|
||||||
|
|
||||||
|
if( is_dir ) {
|
||||||
|
#if 0
|
||||||
|
opal_compress_base_tar_create(&base_fname);
|
||||||
|
asprintf(cname, "%s.gz", base_fname);
|
||||||
|
asprintf(&cmd, "gzip %s", base_fname);
|
||||||
|
#else
|
||||||
|
asprintf(cname, "%s.tar.gz", base_fname);
|
||||||
|
asprintf(&cmd, "tar -zcf %s %s", *cname, base_fname);
|
||||||
|
#endif
|
||||||
|
} else {
|
||||||
|
asprintf(cname, "%s.gz", base_fname);
|
||||||
|
asprintf(&cmd, "gzip %s", base_fname);
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: compress_nb(%s -> [%s])",
|
||||||
|
fname, *cname);
|
||||||
|
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: compress_nb() command [%s]",
|
||||||
|
cmd);
|
||||||
|
|
||||||
|
argv = opal_argv_split(cmd, ' ');
|
||||||
|
status = execvp(argv[0], argv);
|
||||||
|
|
||||||
|
opal_output(0, "compress:gzip: compress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
else if( *child_pid > 0 ) {
|
||||||
|
if( is_dir ) {
|
||||||
|
*postfix = strdup(".tar.gz");
|
||||||
|
} else {
|
||||||
|
*postfix = strdup(".gz");
|
||||||
|
}
|
||||||
|
asprintf(cname, "%s%s", fname, *postfix);
|
||||||
|
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != cmd ) {
|
||||||
|
free(cmd);
|
||||||
|
cmd = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_gzip_decompress(char * cname, char **fname)
|
||||||
|
{
|
||||||
|
int child_pid = 0;
|
||||||
|
int status = 0;
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: decompress(%s)",
|
||||||
|
cname);
|
||||||
|
|
||||||
|
opal_compress_gzip_decompress_nb(cname, fname, &child_pid);
|
||||||
|
waitpid(child_pid, &status, 0);
|
||||||
|
|
||||||
|
if( WIFEXITED(status) ) {
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
} else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_compress_gzip_decompress_nb(char * cname, char **fname, pid_t *child_pid)
|
||||||
|
{
|
||||||
|
char * cmd = NULL;
|
||||||
|
char **argv = NULL;
|
||||||
|
char * dir_cname = NULL;
|
||||||
|
pid_t loc_pid = 0;
|
||||||
|
int status;
|
||||||
|
bool is_tar;
|
||||||
|
|
||||||
|
if( 0 == strncmp(&(cname[strlen(cname)-7]), ".tar.gz", strlen(".tar.gz")) ) {
|
||||||
|
is_tar = true;
|
||||||
|
}
|
||||||
|
|
||||||
|
*fname = strdup(cname);
|
||||||
|
if( is_tar ) {
|
||||||
|
/* Strip off '.tar.gz' */
|
||||||
|
(*fname)[strlen(cname)-7] = '\0';
|
||||||
|
} else {
|
||||||
|
/* Strip off '.gz' */
|
||||||
|
(*fname)[strlen(cname)-3] = '\0';
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: decompress_nb(%s -> [%s])",
|
||||||
|
cname, *fname);
|
||||||
|
|
||||||
|
*child_pid = fork();
|
||||||
|
if( *child_pid == 0 ) { /* Child */
|
||||||
|
dir_cname = opal_dirname(cname);
|
||||||
|
|
||||||
|
chdir(dir_cname);
|
||||||
|
|
||||||
|
/* Fork(gunzip) */
|
||||||
|
loc_pid = fork();
|
||||||
|
if( loc_pid == 0 ) { /* Child */
|
||||||
|
asprintf(&cmd, "gunzip %s", cname);
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||||
|
"compress:gzip: decompress_nb() command [%s]",
|
||||||
|
cmd);
|
||||||
|
|
||||||
|
argv = opal_argv_split(cmd, ' ');
|
||||||
|
status = execvp(argv[0], argv);
|
||||||
|
|
||||||
|
opal_output(0, "compress:gzip: decompress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
else if( loc_pid > 0 ) { /* Parent */
|
||||||
|
waitpid(loc_pid, &status, 0);
|
||||||
|
if( !WIFEXITED(status) ) {
|
||||||
|
opal_output(0, "compress:gzip: decompress_nb: Failed to bunzip the file [%s] status = %d\n", cname, status);
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
exit(OPAL_ERROR);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* tar_decompress */
|
||||||
|
if( is_tar ) {
|
||||||
|
/* Strip off '.gz' leaving just '.tar' */
|
||||||
|
cname[strlen(cname)-3] = '\0';
|
||||||
|
opal_compress_base_tar_extract(&cname);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Once this child is done, then directly exit */
|
||||||
|
exit(OPAL_SUCCESS);
|
||||||
|
}
|
||||||
|
else if( *child_pid > 0 ) {
|
||||||
|
;
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != cmd ) {
|
||||||
|
free(cmd);
|
||||||
|
cmd = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
static bool is_directory(char *fname ) {
|
||||||
|
struct stat file_status;
|
||||||
|
int rc;
|
||||||
|
|
||||||
|
if(0 != (rc = stat(fname, &file_status) ) ) {
|
||||||
|
return false;
|
||||||
|
}
|
||||||
|
if(S_ISDIR(file_status.st_mode)) {
|
||||||
|
return true;
|
||||||
|
}
|
||||||
|
|
||||||
|
return false;
|
||||||
|
}
|
13
opal/mca/compress/gzip/configure.params
Обычный файл
13
opal/mca/compress/gzip/configure.params
Обычный файл
@ -0,0 +1,13 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
PARAM_INIT_FILE=compress_gzip_component.c
|
||||||
|
PARAM_CONFIG_FILES="Makefile"
|
13
opal/mca/compress/gzip/help-opal-compress-gzip.txt
Обычный файл
13
opal/mca/compress/gzip/help-opal-compress-gzip.txt
Обычный файл
@ -0,0 +1,13 @@
|
|||||||
|
-*- text -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
# This is the US/English general help file for Open PAL Compress framework.
|
||||||
|
#
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -23,6 +23,7 @@
|
|||||||
#include "opal_config.h"
|
#include "opal_config.h"
|
||||||
#include "opal/mca/crs/crs.h"
|
#include "opal/mca/crs/crs.h"
|
||||||
#include "opal/util/opal_environ.h"
|
#include "opal/util/opal_environ.h"
|
||||||
|
#include "opal/runtime/opal_cr.h"
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Global functions for MCA overall CRS
|
* Global functions for MCA overall CRS
|
||||||
@ -32,7 +33,7 @@ BEGIN_C_DECLS
|
|||||||
|
|
||||||
/* Some local strings to use genericly with the local metadata file */
|
/* Some local strings to use genericly with the local metadata file */
|
||||||
#define CRS_METADATA_BASE ("# ")
|
#define CRS_METADATA_BASE ("# ")
|
||||||
#define CRS_METADATA_COMP ("# Component: ")
|
#define CRS_METADATA_COMP ("# OPAL CRS Component: ")
|
||||||
#define CRS_METADATA_PID ("# PID: ")
|
#define CRS_METADATA_PID ("# PID: ")
|
||||||
#define CRS_METADATA_CONTEXT ("# CONTEXT: ")
|
#define CRS_METADATA_CONTEXT ("# CONTEXT: ")
|
||||||
#define CRS_METADATA_MKDIR ("# MKDIR: ")
|
#define CRS_METADATA_MKDIR ("# MKDIR: ")
|
||||||
@ -71,35 +72,25 @@ BEGIN_C_DECLS
|
|||||||
/**
|
/**
|
||||||
* Globals
|
* Globals
|
||||||
*/
|
*/
|
||||||
#define opal_crs_base_metadata_filename (strdup("snapshot_meta.data"))
|
|
||||||
|
|
||||||
OPAL_DECLSPEC extern int opal_crs_base_output;
|
OPAL_DECLSPEC extern int opal_crs_base_output;
|
||||||
OPAL_DECLSPEC extern opal_list_t opal_crs_base_components_available;
|
OPAL_DECLSPEC extern opal_list_t opal_crs_base_components_available;
|
||||||
OPAL_DECLSPEC extern opal_crs_base_component_t opal_crs_base_selected_component;
|
OPAL_DECLSPEC extern opal_crs_base_component_t opal_crs_base_selected_component;
|
||||||
OPAL_DECLSPEC extern opal_crs_base_module_t opal_crs;
|
OPAL_DECLSPEC extern opal_crs_base_module_t opal_crs;
|
||||||
OPAL_DECLSPEC extern char * opal_crs_base_snapshot_dir;
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Some utility functions
|
* Some utility functions
|
||||||
*/
|
*/
|
||||||
OPAL_DECLSPEC char * opal_crs_base_state_str(opal_crs_state_type_t state);
|
OPAL_DECLSPEC char * opal_crs_base_state_str(opal_crs_state_type_t state);
|
||||||
|
|
||||||
OPAL_DECLSPEC char * opal_crs_base_unique_snapshot_name(pid_t pid);
|
/*
|
||||||
OPAL_DECLSPEC int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** component_name, int *prev_pid);
|
* Extract the expected component and pid from the metadata
|
||||||
OPAL_DECLSPEC int opal_crs_base_init_snapshot_directory(opal_crs_base_snapshot_t *snapshot);
|
*/
|
||||||
OPAL_DECLSPEC char * opal_crs_base_get_snapshot_directory(char *uniq_snapshot_name);
|
OPAL_DECLSPEC int opal_crs_base_extract_expected_component(FILE *metadata, char ** component_name, int *prev_pid);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Read a token to the metadata file
|
* Read a token to the metadata file
|
||||||
* NULL can be passed for snapshot_loc if nit_snapshot_directory has been called.
|
|
||||||
*/
|
*/
|
||||||
OPAL_DECLSPEC int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***value);
|
OPAL_DECLSPEC int opal_crs_base_metadata_read_token(FILE *metadata, char * token, char ***value);
|
||||||
|
|
||||||
/*
|
|
||||||
* Write a token to the metadata file
|
|
||||||
* NULL can be passed for snapshot_loc if nit_snapshot_directory has been called.
|
|
||||||
*/
|
|
||||||
OPAL_DECLSPEC int opal_crs_base_metadata_write_token(char *snapshot_loc, char * token, char *value);
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Register a file for cleanup.
|
* Register a file for cleanup.
|
||||||
@ -122,6 +113,24 @@ BEGIN_C_DECLS
|
|||||||
*/
|
*/
|
||||||
OPAL_DECLSPEC int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target);
|
OPAL_DECLSPEC int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* CRS self application interface functions
|
||||||
|
*/
|
||||||
|
typedef int (*opal_crs_base_self_checkpoint_fn_t)(char **restart_cmd);
|
||||||
|
typedef int (*opal_crs_base_self_restart_fn_t)(void);
|
||||||
|
typedef int (*opal_crs_base_self_continue_fn_t)(void);
|
||||||
|
|
||||||
|
extern opal_crs_base_self_checkpoint_fn_t crs_base_self_checkpoint_fn;
|
||||||
|
extern opal_crs_base_self_restart_fn_t crs_base_self_restart_fn;
|
||||||
|
extern opal_crs_base_self_continue_fn_t crs_base_self_continue_fn;
|
||||||
|
|
||||||
|
OPAL_DECLSPEC int opal_crs_base_self_register_checkpoint_callback
|
||||||
|
(opal_crs_base_self_checkpoint_fn_t function);
|
||||||
|
OPAL_DECLSPEC int opal_crs_base_self_register_restart_callback
|
||||||
|
(opal_crs_base_self_restart_fn_t function);
|
||||||
|
OPAL_DECLSPEC int opal_crs_base_self_register_continue_callback
|
||||||
|
(opal_crs_base_self_continue_fn_t function);
|
||||||
|
|
||||||
END_C_DECLS
|
END_C_DECLS
|
||||||
|
|
||||||
#endif /* OPAL_CRS_BASE_H */
|
#endif /* OPAL_CRS_BASE_H */
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2007 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -24,6 +24,12 @@
|
|||||||
|
|
||||||
int opal_crs_base_close(void)
|
int opal_crs_base_close(void)
|
||||||
{
|
{
|
||||||
|
if( !opal_cr_is_enabled ) {
|
||||||
|
opal_output_verbose(10, opal_crs_base_output,
|
||||||
|
"crs:close: FT is not enabled, skipping!");
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
/* Call the component's finalize routine */
|
/* Call the component's finalize routine */
|
||||||
if( NULL != opal_crs.crs_finalize ) {
|
if( NULL != opal_crs.crs_finalize ) {
|
||||||
opal_crs.crs_finalize();
|
opal_crs.crs_finalize();
|
||||||
|
@ -44,13 +44,15 @@
|
|||||||
#include "opal/mca/crs/crs.h"
|
#include "opal/mca/crs/crs.h"
|
||||||
#include "opal/mca/crs/base/base.h"
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
|
||||||
|
opal_crs_base_self_checkpoint_fn_t crs_base_self_checkpoint_fn;
|
||||||
|
opal_crs_base_self_restart_fn_t crs_base_self_restart_fn;
|
||||||
|
opal_crs_base_self_continue_fn_t crs_base_self_continue_fn;
|
||||||
|
|
||||||
/******************
|
/******************
|
||||||
* Local Functions
|
* Local Functions
|
||||||
******************/
|
******************/
|
||||||
static int metadata_extract_next_token(FILE *file, char **token, char **value);
|
static int metadata_extract_next_token(FILE *file, char **token, char **value);
|
||||||
static int opal_crs_base_metadata_open(FILE ** meta_data, char * location, char * mode);
|
|
||||||
|
|
||||||
static char *last_metadata_file = NULL;
|
|
||||||
static char **cleanup_file_argv = NULL;
|
static char **cleanup_file_argv = NULL;
|
||||||
static char **cleanup_dir_argv = NULL;
|
static char **cleanup_dir_argv = NULL;
|
||||||
|
|
||||||
@ -60,29 +62,29 @@ static char **cleanup_dir_argv = NULL;
|
|||||||
static void opal_crs_base_construct(opal_crs_base_snapshot_t *snapshot)
|
static void opal_crs_base_construct(opal_crs_base_snapshot_t *snapshot)
|
||||||
{
|
{
|
||||||
snapshot->component_name = NULL;
|
snapshot->component_name = NULL;
|
||||||
snapshot->reference_name = opal_crs_base_unique_snapshot_name(getpid());
|
|
||||||
snapshot->local_location = opal_crs_base_get_snapshot_directory(snapshot->reference_name);
|
snapshot->metadata_filename = NULL;
|
||||||
snapshot->remote_location = strdup(snapshot->local_location);
|
snapshot->metadata = NULL;
|
||||||
|
snapshot->snapshot_directory = NULL;
|
||||||
|
|
||||||
snapshot->cold_start = false;
|
snapshot->cold_start = false;
|
||||||
}
|
}
|
||||||
|
|
||||||
static void opal_crs_base_destruct( opal_crs_base_snapshot_t *snapshot)
|
static void opal_crs_base_destruct( opal_crs_base_snapshot_t *snapshot)
|
||||||
{
|
{
|
||||||
if(NULL != snapshot->reference_name) {
|
if(NULL != snapshot->metadata_filename ) {
|
||||||
free(snapshot->reference_name);
|
free(snapshot->metadata_filename);
|
||||||
snapshot->reference_name = NULL;
|
snapshot->metadata_filename = NULL;
|
||||||
}
|
}
|
||||||
if(NULL != snapshot->local_location) {
|
|
||||||
free(snapshot->local_location);
|
if(NULL != snapshot->metadata) {
|
||||||
snapshot->local_location = NULL;
|
fclose(snapshot->metadata);
|
||||||
|
snapshot->metadata = NULL;
|
||||||
}
|
}
|
||||||
if(NULL != snapshot->remote_location) {
|
|
||||||
free(snapshot->remote_location);
|
if(NULL != snapshot->snapshot_directory ) {
|
||||||
snapshot->remote_location = NULL;
|
free(snapshot->snapshot_directory);
|
||||||
}
|
snapshot->snapshot_directory = NULL;
|
||||||
if(NULL != snapshot->component_name) {
|
|
||||||
free(snapshot->component_name);
|
|
||||||
snapshot->component_name = NULL;
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -107,43 +109,29 @@ OBJ_CLASS_INSTANCE(opal_crs_base_ckpt_options_t,
|
|||||||
/*
|
/*
|
||||||
* Utility functions
|
* Utility functions
|
||||||
*/
|
*/
|
||||||
char * opal_crs_base_unique_snapshot_name(pid_t pid)
|
int opal_crs_base_metadata_read_token(FILE *metadata, char * token, char ***value) {
|
||||||
{
|
int exit_status = OPAL_SUCCESS;
|
||||||
char * loc_str = NULL;
|
|
||||||
|
|
||||||
asprintf(&loc_str, "opal_snapshot_%d.ckpt", pid);
|
|
||||||
|
|
||||||
return loc_str;
|
|
||||||
}
|
|
||||||
|
|
||||||
int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***value) {
|
|
||||||
int ret, exit_status = OPAL_SUCCESS;
|
|
||||||
FILE * meta_data = NULL;
|
|
||||||
char * loc_token = NULL;
|
char * loc_token = NULL;
|
||||||
char * loc_value = NULL;
|
char * loc_value = NULL;
|
||||||
int argc = 0;
|
int argc = 0;
|
||||||
|
|
||||||
/* Dummy check */
|
/* Dummy check */
|
||||||
if( NULL == token ) {
|
if( NULL == token ) {
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
if( NULL == metadata ) {
|
||||||
/*
|
exit_status = OPAL_ERROR;
|
||||||
* Open the metadata file
|
|
||||||
*/
|
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_open(&meta_data, snapshot_loc, "r")) ) {
|
|
||||||
opal_output(opal_crs_base_output,
|
|
||||||
"opal:crs:base: opal_crs_base_metadata_read_token: Error: Unable to open the metadata file\n");
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Extract each token and make the records
|
* Extract each token and make the records
|
||||||
*/
|
*/
|
||||||
|
rewind(metadata);
|
||||||
do {
|
do {
|
||||||
/* Get next token */
|
/* Get next token */
|
||||||
if( OPAL_SUCCESS != metadata_extract_next_token(meta_data, &loc_token, &loc_value) ) {
|
if( OPAL_SUCCESS != metadata_extract_next_token(metadata, &loc_token, &loc_value) ) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -151,54 +139,26 @@ int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***
|
|||||||
if(0 == strncmp(token, loc_token, strlen(loc_token)) ) {
|
if(0 == strncmp(token, loc_token, strlen(loc_token)) ) {
|
||||||
opal_argv_append(&argc, value, loc_value);
|
opal_argv_append(&argc, value, loc_value);
|
||||||
}
|
}
|
||||||
} while(0 == feof(meta_data) );
|
} while(0 == feof(metadata) );
|
||||||
|
|
||||||
cleanup:
|
cleanup:
|
||||||
if(NULL != meta_data) {
|
rewind(metadata);
|
||||||
fclose(meta_data);
|
|
||||||
meta_data = NULL;
|
|
||||||
}
|
|
||||||
|
|
||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
|
||||||
int opal_crs_base_metadata_write_token(char *snapshot_loc, char * token, char *value) {
|
int opal_crs_base_extract_expected_component(FILE *metadata, char ** component_name, int *prev_pid)
|
||||||
int ret, exit_status = OPAL_SUCCESS;
|
|
||||||
FILE * meta_data = NULL;
|
|
||||||
|
|
||||||
/* Dummy check */
|
|
||||||
if( NULL == token || NULL == value) {
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Open the metadata file
|
|
||||||
*/
|
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_open(&meta_data, snapshot_loc, "a")) ) {
|
|
||||||
opal_output(opal_crs_base_output,
|
|
||||||
"opal:crs:base: opal_crs_base_metadata_write_token: Error: Unable to open the metadata file\n");
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
fprintf(meta_data, "%s%s\n", token, value);
|
|
||||||
|
|
||||||
cleanup:
|
|
||||||
if(NULL != meta_data) {
|
|
||||||
fclose(meta_data);
|
|
||||||
meta_data = NULL;
|
|
||||||
}
|
|
||||||
|
|
||||||
return exit_status;
|
|
||||||
}
|
|
||||||
|
|
||||||
int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** component_name, int *prev_pid)
|
|
||||||
{
|
{
|
||||||
int exit_status = OPAL_SUCCESS;
|
int exit_status = OPAL_SUCCESS;
|
||||||
char **pid_argv = NULL;
|
char **pid_argv = NULL;
|
||||||
char **name_argv = NULL;
|
char **name_argv = NULL;
|
||||||
|
|
||||||
opal_crs_base_metadata_read_token(snapshot_loc, CRS_METADATA_PID, &pid_argv);
|
/* Dummy check */
|
||||||
|
if( NULL == metadata ) {
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_crs_base_metadata_read_token(metadata, CRS_METADATA_PID, &pid_argv);
|
||||||
if( NULL != pid_argv && NULL != pid_argv[0] ) {
|
if( NULL != pid_argv && NULL != pid_argv[0] ) {
|
||||||
*prev_pid = atoi(pid_argv[0]);
|
*prev_pid = atoi(pid_argv[0]);
|
||||||
} else {
|
} else {
|
||||||
@ -207,7 +167,7 @@ int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** compone
|
|||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
opal_crs_base_metadata_read_token(snapshot_loc, CRS_METADATA_COMP, &name_argv);
|
opal_crs_base_metadata_read_token(metadata, CRS_METADATA_COMP, &name_argv);
|
||||||
if( NULL != name_argv && NULL != name_argv[0] ) {
|
if( NULL != name_argv && NULL != name_argv[0] ) {
|
||||||
*component_name = strdup(name_argv[0]);
|
*component_name = strdup(name_argv[0]);
|
||||||
} else {
|
} else {
|
||||||
@ -230,68 +190,6 @@ int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** compone
|
|||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
|
||||||
char * opal_crs_base_get_snapshot_directory(char *uniq_snapshot_name)
|
|
||||||
{
|
|
||||||
char * dir_name = NULL;
|
|
||||||
|
|
||||||
asprintf(&dir_name, "%s/%s", opal_crs_base_snapshot_dir, uniq_snapshot_name);
|
|
||||||
|
|
||||||
return dir_name;
|
|
||||||
}
|
|
||||||
|
|
||||||
int opal_crs_base_init_snapshot_directory(opal_crs_base_snapshot_t *snapshot)
|
|
||||||
{
|
|
||||||
int ret, exit_status = OPAL_SUCCESS;
|
|
||||||
mode_t my_mode = S_IRWXU;
|
|
||||||
char * pid_str = NULL;
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Make the snapshot directory from the uniq_snapshot_name
|
|
||||||
*/
|
|
||||||
if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(snapshot->local_location, my_mode)) ) {
|
|
||||||
opal_output(opal_crs_base_output,
|
|
||||||
"opal:crs:base: init_snapshot_directory: Error: Unable to create directory (%s)\n",
|
|
||||||
snapshot->local_location);
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Initialize the metadata file at the top of that directory.
|
|
||||||
* Add 'BASE' and 'PID'
|
|
||||||
*/
|
|
||||||
if( NULL != last_metadata_file ) {
|
|
||||||
free(last_metadata_file);
|
|
||||||
last_metadata_file = NULL;
|
|
||||||
}
|
|
||||||
last_metadata_file = strdup(snapshot->local_location);
|
|
||||||
|
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_BASE, "") ) ) {
|
|
||||||
opal_output(opal_crs_base_output,
|
|
||||||
"opal:crs:base: init_snapshot_directory: Error: Unable to write BASE to the file (%s/%s)\n",
|
|
||||||
snapshot->local_location, opal_crs_base_metadata_filename);
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
asprintf(&pid_str, "%d", getpid());
|
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_PID, pid_str) ) ) {
|
|
||||||
opal_output(opal_crs_base_output,
|
|
||||||
"opal:crs:base: init_snapshot_directory: Error: Unable to write PID (%s) to the file (%s/%s)\n",
|
|
||||||
pid_str, snapshot->local_location, opal_crs_base_metadata_filename);
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
cleanup:
|
|
||||||
if( NULL != pid_str) {
|
|
||||||
free(pid_str);
|
|
||||||
pid_str = NULL;
|
|
||||||
}
|
|
||||||
|
|
||||||
return OPAL_SUCCESS;
|
|
||||||
}
|
|
||||||
|
|
||||||
int opal_crs_base_cleanup_append(char* filename, bool is_dir)
|
int opal_crs_base_cleanup_append(char* filename, bool is_dir)
|
||||||
{
|
{
|
||||||
if( NULL == filename ) {
|
if( NULL == filename ) {
|
||||||
@ -399,6 +297,14 @@ int opal_crs_base_copy_options(opal_crs_base_ckpt_options_t *from,
|
|||||||
to->term = from->term;
|
to->term = from->term;
|
||||||
to->stop = from->stop;
|
to->stop = from->stop;
|
||||||
|
|
||||||
|
to->inc_prep_only = from->inc_prep_only;
|
||||||
|
to->inc_recover_only = from->inc_recover_only;
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
to->attach_debugger = from->attach_debugger;
|
||||||
|
to->detach_debugger = from->detach_debugger;
|
||||||
|
#endif
|
||||||
|
|
||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -413,6 +319,32 @@ int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target)
|
|||||||
target->term = false;
|
target->term = false;
|
||||||
target->stop = false;
|
target->stop = false;
|
||||||
|
|
||||||
|
target->inc_prep_only = false;
|
||||||
|
target->inc_recover_only = false;
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
target->attach_debugger = false;
|
||||||
|
target->detach_debugger = false;
|
||||||
|
#endif
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_crs_base_self_register_checkpoint_callback(opal_crs_base_self_checkpoint_fn_t function)
|
||||||
|
{
|
||||||
|
crs_base_self_checkpoint_fn = function;
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_crs_base_self_register_restart_callback(opal_crs_base_self_restart_fn_t function)
|
||||||
|
{
|
||||||
|
crs_base_self_restart_fn = function;
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_crs_base_self_register_continue_callback(opal_crs_base_self_continue_fn_t function)
|
||||||
|
{
|
||||||
|
crs_base_self_continue_fn = function;
|
||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -420,38 +352,6 @@ int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target)
|
|||||||
/******************
|
/******************
|
||||||
* Local Functions
|
* Local Functions
|
||||||
******************/
|
******************/
|
||||||
static int opal_crs_base_metadata_open(FILE **meta_data, char * location, char * mode)
|
|
||||||
{
|
|
||||||
int exit_status = OPAL_SUCCESS;
|
|
||||||
char * dir_name = NULL;
|
|
||||||
|
|
||||||
if( NULL == location ) {
|
|
||||||
if( NULL == last_metadata_file ) {
|
|
||||||
opal_output(0, "Error: No metadata filename specified!");
|
|
||||||
exit_status = OPAL_ERROR;
|
|
||||||
goto cleanup;
|
|
||||||
} else {
|
|
||||||
location = last_metadata_file;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Find the snapshot directory, read the metadata file
|
|
||||||
*/
|
|
||||||
asprintf(&dir_name, "%s/%s", location, opal_crs_base_metadata_filename);
|
|
||||||
if (NULL == (*meta_data = fopen(dir_name, mode)) ) {
|
|
||||||
exit_status = OPAL_ERROR;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
cleanup:
|
|
||||||
if( NULL != dir_name ) {
|
|
||||||
free(dir_name);
|
|
||||||
dir_name = NULL;
|
|
||||||
}
|
|
||||||
return exit_status;
|
|
||||||
}
|
|
||||||
|
|
||||||
static int metadata_extract_next_token(FILE *file, char **token, char **value)
|
static int metadata_extract_next_token(FILE *file, char **token, char **value)
|
||||||
{
|
{
|
||||||
int exit_status = OPAL_SUCCESS;
|
int exit_status = OPAL_SUCCESS;
|
||||||
@ -558,12 +458,20 @@ static int metadata_extract_next_token(FILE *file, char **token, char **value)
|
|||||||
*value = strdup(local_value);
|
*value = strdup(local_value);
|
||||||
|
|
||||||
cleanup:
|
cleanup:
|
||||||
if( NULL != local_token)
|
if( NULL != local_token) {
|
||||||
free(local_token);
|
free(local_token);
|
||||||
if( NULL != local_value)
|
local_token = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != local_value) {
|
||||||
free(local_value);
|
free(local_value);
|
||||||
if( NULL != line)
|
local_value = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != line) {
|
||||||
free(line);
|
free(line);
|
||||||
|
line = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -48,7 +48,6 @@ opal_crs_base_module_t opal_crs = {
|
|||||||
};
|
};
|
||||||
opal_list_t opal_crs_base_components_available;
|
opal_list_t opal_crs_base_components_available;
|
||||||
opal_crs_base_component_t opal_crs_base_selected_component;
|
opal_crs_base_component_t opal_crs_base_selected_component;
|
||||||
char * opal_crs_base_snapshot_dir = NULL;
|
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Function for finding and opening either all MCA components,
|
* Function for finding and opening either all MCA components,
|
||||||
@ -73,14 +72,6 @@ int opal_crs_base_open(void)
|
|||||||
}
|
}
|
||||||
opal_output_set_verbosity(opal_crs_base_output, value);
|
opal_output_set_verbosity(opal_crs_base_output, value);
|
||||||
|
|
||||||
/* Base snapshot directory */
|
|
||||||
mca_base_param_reg_string_name("crs",
|
|
||||||
"base_snapshot_dir",
|
|
||||||
"The base directory to use when storing snapshots",
|
|
||||||
true, false,
|
|
||||||
strdup("/tmp"),
|
|
||||||
&opal_crs_base_snapshot_dir);
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Which CRS component to open
|
* Which CRS component to open
|
||||||
* - NULL or "" = auto-select
|
* - NULL or "" = auto-select
|
||||||
@ -90,7 +81,13 @@ int opal_crs_base_open(void)
|
|||||||
mca_base_param_reg_string_name("crs", NULL,
|
mca_base_param_reg_string_name("crs", NULL,
|
||||||
"Which CRS component to use (empty = auto-select)",
|
"Which CRS component to use (empty = auto-select)",
|
||||||
false, false,
|
false, false,
|
||||||
"none", &str_value);
|
NULL, &str_value);
|
||||||
|
|
||||||
|
if( !opal_cr_is_enabled ) {
|
||||||
|
opal_output_verbose(10, opal_crs_base_output,
|
||||||
|
"crs:open: FT is not enabled, skipping!");
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
/* Open up all available components */
|
/* Open up all available components */
|
||||||
if (OPAL_SUCCESS != (ret = mca_base_components_open("crs",
|
if (OPAL_SUCCESS != (ret = mca_base_components_open("crs",
|
||||||
@ -110,5 +107,6 @@ int opal_crs_base_open(void)
|
|||||||
if( NULL != str_value ) {
|
if( NULL != str_value ) {
|
||||||
free(str_value);
|
free(str_value);
|
||||||
}
|
}
|
||||||
|
|
||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
@ -37,6 +37,12 @@ int opal_crs_base_select(void)
|
|||||||
opal_crs_base_module_t *best_module = NULL;
|
opal_crs_base_module_t *best_module = NULL;
|
||||||
int int_value = 0;
|
int int_value = 0;
|
||||||
|
|
||||||
|
if( !opal_cr_is_enabled ) {
|
||||||
|
opal_output_verbose(10, opal_crs_base_output,
|
||||||
|
"crs:select: FT is not enabled, skipping!");
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Note: If we are a tool, then we will manually run the selection routine
|
* Note: If we are a tool, then we will manually run the selection routine
|
||||||
* for the checkpointer. The tool will set the MCA parameter
|
* for the checkpointer. The tool will set the MCA parameter
|
||||||
|
@ -167,6 +167,14 @@ AC_DEFUN([MCA_crs_blcr_CONFIG],[
|
|||||||
[BLCRs cr_checkpoint_info.requester member availability])
|
[BLCRs cr_checkpoint_info.requester member availability])
|
||||||
$1])
|
$1])
|
||||||
|
|
||||||
|
#
|
||||||
|
# Require either a working cr_request_file() or cr_request_checkpoint() function
|
||||||
|
#
|
||||||
|
AS_IF([test "$crs_blcr_have_working_cr_request" = "0" -a "$crs_blcr_have_cr_request_checkpoint" = "0"],
|
||||||
|
[$2
|
||||||
|
check_crs_blcr_good="no"
|
||||||
|
AC_MSG_WARN([The BLCR CRS component requires either the cr_request_checkpoint() or cr_request_file() functions])])
|
||||||
|
|
||||||
#
|
#
|
||||||
# Reset the flags
|
# Reset the flags
|
||||||
#
|
#
|
||||||
|
@ -34,6 +34,7 @@
|
|||||||
|
|
||||||
#include "opal/mca/base/mca_base_param.h"
|
#include "opal/mca/base/mca_base_param.h"
|
||||||
|
|
||||||
|
#include "opal/threads/threads.h"
|
||||||
#include "opal/threads/mutex.h"
|
#include "opal/threads/mutex.h"
|
||||||
#include "opal/threads/condition.h"
|
#include "opal/threads/condition.h"
|
||||||
|
|
||||||
@ -94,20 +95,26 @@ OBJ_CLASS_INSTANCE(opal_crs_blcr_snapshot_t,
|
|||||||
/******************
|
/******************
|
||||||
* Local Functions
|
* Local Functions
|
||||||
******************/
|
******************/
|
||||||
static int blcr_checkpoint_peer(pid_t pid, char * local_dir, char ** fname);
|
|
||||||
static int blcr_get_checkpoint_filename(char **fname, pid_t pid);
|
static int blcr_get_checkpoint_filename(char **fname, pid_t pid);
|
||||||
static int opal_crs_blcr_thread_callback(void *arg);
|
static int opal_crs_blcr_thread_callback(void *arg);
|
||||||
static int opal_crs_blcr_signal_callback(void *arg);
|
static int opal_crs_blcr_signal_callback(void *arg);
|
||||||
|
|
||||||
static int opal_crs_blcr_checkpoint_cmd(pid_t pid, char * local_dir, char **fname, char **cmd);
|
|
||||||
static int opal_crs_blcr_restart_cmd(char *fname, char **cmd);
|
static int opal_crs_blcr_restart_cmd(char *fname, char **cmd);
|
||||||
|
|
||||||
static int blcr_update_snapshot_metadata(opal_crs_blcr_snapshot_t *snapshot);
|
|
||||||
static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot);
|
static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot);
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
static void MPIR_checkpoint_debugger_crs_hook(cr_hook_event_t event);
|
||||||
|
#endif
|
||||||
|
|
||||||
/*************************
|
/*************************
|
||||||
* Local Global Variables
|
* Local Global Variables
|
||||||
*************************/
|
*************************/
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
static opal_thread_t *checkpoint_thread_id = NULL;
|
||||||
|
static bool blcr_crdebug_refreshed_env = false;
|
||||||
|
#endif
|
||||||
|
|
||||||
static cr_client_id_t client_id;
|
static cr_client_id_t client_id;
|
||||||
static cr_callback_id_t cr_thread_callback_id;
|
static cr_callback_id_t cr_thread_callback_id;
|
||||||
static cr_callback_id_t cr_signal_callback_id;
|
static cr_callback_id_t cr_signal_callback_id;
|
||||||
@ -127,8 +134,10 @@ void opal_crs_blcr_construct(opal_crs_blcr_snapshot_t *snapshot) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
void opal_crs_blcr_destruct( opal_crs_blcr_snapshot_t *snapshot) {
|
void opal_crs_blcr_destruct( opal_crs_blcr_snapshot_t *snapshot) {
|
||||||
if(NULL != snapshot->context_filename)
|
if(NULL != snapshot->context_filename) {
|
||||||
free(snapshot->context_filename);
|
free(snapshot->context_filename);
|
||||||
|
snapshot->context_filename = NULL;
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
/*****************
|
/*****************
|
||||||
@ -167,6 +176,10 @@ int opal_crs_blcr_module_init(void)
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
blcr_crdebug_refreshed_env = false;
|
||||||
|
#endif
|
||||||
|
|
||||||
blcr_restart_cmd = strdup("cr_restart");
|
blcr_restart_cmd = strdup("cr_restart");
|
||||||
blcr_checkpoint_cmd = strdup("cr_checkpoint");
|
blcr_checkpoint_cmd = strdup("cr_checkpoint");
|
||||||
|
|
||||||
@ -190,6 +203,20 @@ int opal_crs_blcr_module_init(void)
|
|||||||
cr_signal_callback_id = cr_register_callback(opal_crs_blcr_signal_callback,
|
cr_signal_callback_id = cr_register_callback(opal_crs_blcr_signal_callback,
|
||||||
crs_blcr_signal_callback_arg,
|
crs_blcr_signal_callback_arg,
|
||||||
CR_SIGNAL_CONTEXT);
|
CR_SIGNAL_CONTEXT);
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/*
|
||||||
|
* Checkpoint/restart enabled debugging hooks
|
||||||
|
* "NO_CALLBACKS" -> non-MPI threads
|
||||||
|
* "SIGNAL_CONTEXT" -> MPI threads
|
||||||
|
* "THREAD_CONTEXT" -> BLCR threads
|
||||||
|
*/
|
||||||
|
cr_register_hook(CR_HOOK_CONT_NO_CALLBACKS, MPIR_checkpoint_debugger_crs_hook);
|
||||||
|
cr_register_hook(CR_HOOK_CONT_SIGNAL_CONTEXT, MPIR_checkpoint_debugger_crs_hook);
|
||||||
|
|
||||||
|
cr_register_hook(CR_HOOK_RSTRT_NO_CALLBACKS, MPIR_checkpoint_debugger_crs_hook);
|
||||||
|
cr_register_hook(CR_HOOK_RSTRT_SIGNAL_CONTEXT, MPIR_checkpoint_debugger_crs_hook);
|
||||||
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
@ -262,6 +289,17 @@ int opal_crs_blcr_module_finalize(void)
|
|||||||
cr_replace_callback(cr_thread_callback_id, NULL, NULL, CR_THREAD_CONTEXT);
|
cr_replace_callback(cr_thread_callback_id, NULL, NULL, CR_THREAD_CONTEXT);
|
||||||
/* Unload the signal callback */
|
/* Unload the signal callback */
|
||||||
cr_replace_callback(cr_signal_callback_id, NULL, NULL, CR_SIGNAL_CONTEXT);
|
cr_replace_callback(cr_signal_callback_id, NULL, NULL, CR_SIGNAL_CONTEXT);
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/*
|
||||||
|
* Checkpoint/restart enabled debugging hooks
|
||||||
|
*/
|
||||||
|
cr_register_hook(CR_HOOK_CONT_NO_CALLBACKS, NULL);
|
||||||
|
cr_register_hook(CR_HOOK_CONT_SIGNAL_CONTEXT, NULL);
|
||||||
|
|
||||||
|
cr_register_hook(CR_HOOK_RSTRT_NO_CALLBACKS, NULL);
|
||||||
|
cr_register_hook(CR_HOOK_RSTRT_SIGNAL_CONTEXT, NULL);
|
||||||
|
#endif
|
||||||
}
|
}
|
||||||
|
|
||||||
/* BLCR does not have a finalization routine */
|
/* BLCR does not have a finalization routine */
|
||||||
@ -275,61 +313,78 @@ int opal_crs_blcr_checkpoint(pid_t pid,
|
|||||||
opal_crs_state_type_t *state)
|
opal_crs_state_type_t *state)
|
||||||
{
|
{
|
||||||
int ret, exit_status = OPAL_SUCCESS;
|
int ret, exit_status = OPAL_SUCCESS;
|
||||||
opal_crs_blcr_snapshot_t *snapshot = OBJ_NEW(opal_crs_blcr_snapshot_t);
|
opal_crs_blcr_snapshot_t *snapshot = NULL;
|
||||||
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1
|
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1
|
||||||
cr_checkpoint_args_t cr_args;
|
cr_checkpoint_args_t cr_args;
|
||||||
static cr_checkpoint_handle_t cr_handle = (cr_checkpoint_handle_t)(-1);
|
static cr_checkpoint_handle_t cr_handle = (cr_checkpoint_handle_t)(-1);
|
||||||
#endif
|
#endif
|
||||||
|
int fd = 0;
|
||||||
|
char *loc_fname = NULL;
|
||||||
|
|
||||||
|
if( pid != my_pid ) {
|
||||||
|
opal_output(0, "crs:blcr: checkpoint(%d, ---): Checkpointing of peers not allowed!", pid);
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
"crs:blcr: checkpoint(%d, ---)", pid);
|
"crs:blcr: checkpoint(%d, ---)", pid);
|
||||||
|
|
||||||
if(NULL != snapshot->super.reference_name)
|
snapshot = (opal_crs_blcr_snapshot_t *)base_snapshot;
|
||||||
free(snapshot->super.reference_name);
|
|
||||||
snapshot->super.reference_name = strdup(base_snapshot->reference_name);
|
|
||||||
|
|
||||||
if(NULL != snapshot->super.local_location)
|
|
||||||
free(snapshot->super.local_location);
|
|
||||||
snapshot->super.local_location = strdup(base_snapshot->local_location);
|
|
||||||
|
|
||||||
if(NULL != snapshot->super.remote_location)
|
|
||||||
free(snapshot->super.remote_location);
|
|
||||||
snapshot->super.remote_location = strdup(base_snapshot->remote_location);
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Update the snapshot metadata
|
* Update the snapshot metadata
|
||||||
*/
|
*/
|
||||||
snapshot->super.component_name = strdup(mca_crs_blcr_component.super.base_version.mca_component_name);
|
snapshot->super.component_name = strdup(mca_crs_blcr_component.super.base_version.mca_component_name);
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, snapshot->super.component_name) ) ) {
|
blcr_get_checkpoint_filename(&(snapshot->context_filename), pid);
|
||||||
|
|
||||||
|
if( NULL == snapshot->super.metadata ) {
|
||||||
|
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) {
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||||
"crs:blcr: checkpoint(): Error: Unable to write component name to the directory for (%s).",
|
"crs:blcr: checkpoint(): Error: Unable to open the file (%s)",
|
||||||
snapshot->super.reference_name);
|
snapshot->super.metadata_filename);
|
||||||
exit_status = ret;
|
exit_status = OPAL_ERROR;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->super.component_name);
|
||||||
|
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_CONTEXT, snapshot->context_filename);
|
||||||
|
|
||||||
|
fclose(snapshot->super.metadata );
|
||||||
|
snapshot->super.metadata = NULL;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* If we can checkpointing ourselves do so:
|
* If we can checkpointing ourselves do so:
|
||||||
* use cr_request_checkpoint() if available, and cr_request_file() if not
|
* use cr_request_checkpoint() if available, and cr_request_file() if not
|
||||||
*/
|
*/
|
||||||
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 || CRS_BLCR_HAVE_CR_REQUEST == 1
|
|
||||||
if( pid == my_pid ) {
|
|
||||||
char *loc_fname = NULL;
|
|
||||||
|
|
||||||
blcr_get_checkpoint_filename(&(snapshot->context_filename), pid);
|
|
||||||
if( opal_crs_blcr_dev_null ) {
|
if( opal_crs_blcr_dev_null ) {
|
||||||
loc_fname = strdup("/dev/null");
|
loc_fname = strdup("/dev/null");
|
||||||
} else {
|
} else {
|
||||||
asprintf(&loc_fname, "%s/%s", snapshot->super.local_location, snapshot->context_filename);
|
asprintf(&loc_fname, "%s/%s", snapshot->super.snapshot_directory, snapshot->context_filename);
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/* Make sure to identify the checkpointing thread, so that it is not
|
||||||
|
* prevented from requesting the checkpoint after the debugger detaches
|
||||||
|
*/
|
||||||
|
opal_cr_debug_set_current_ckpt_thread_self();
|
||||||
|
checkpoint_thread_id = opal_thread_get_self();
|
||||||
|
blcr_crdebug_refreshed_env = false;
|
||||||
|
|
||||||
|
/* If checkpoint/restart enabled debugging then mark detachment place */
|
||||||
|
if( MPIR_debug_with_checkpoint ) {
|
||||||
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
|
"crs:blcr: checkpoint(): Detaching debugger...");
|
||||||
|
MPIR_checkpoint_debugger_detach();
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
"crs:blcr: checkpoint SELF <%s>",
|
"crs:blcr: checkpoint SELF <%s>",
|
||||||
loc_fname);
|
loc_fname);
|
||||||
|
|
||||||
|
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 || CRS_BLCR_HAVE_CR_REQUEST == 1
|
||||||
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1
|
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1
|
||||||
{
|
|
||||||
int fd = 0;
|
|
||||||
fd = open(loc_fname,
|
fd = open(loc_fname,
|
||||||
O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE,
|
O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE,
|
||||||
S_IRUSR | S_IWUSR);
|
S_IRUSR | S_IWUSR);
|
||||||
@ -387,7 +442,6 @@ int opal_crs_blcr_checkpoint(pid_t pid,
|
|||||||
|
|
||||||
/* Close the file */
|
/* Close the file */
|
||||||
close(cr_args.cr_fd);
|
close(cr_args.cr_fd);
|
||||||
}
|
|
||||||
#else
|
#else
|
||||||
/* Request a checkpoint be taken of the current process.
|
/* Request a checkpoint be taken of the current process.
|
||||||
* Since we are not guaranteed to finish the checkpoint before this
|
* Since we are not guaranteed to finish the checkpoint before this
|
||||||
@ -399,51 +453,18 @@ int opal_crs_blcr_checkpoint(pid_t pid,
|
|||||||
do {
|
do {
|
||||||
usleep(1000); /* JJH Do we really want to sleep? */
|
usleep(1000); /* JJH Do we really want to sleep? */
|
||||||
} while(CR_STATE_IDLE != cr_status());
|
} while(CR_STATE_IDLE != cr_status());
|
||||||
|
#endif
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
*state = blcr_current_state;
|
*state = blcr_current_state;
|
||||||
free(loc_fname);
|
free(loc_fname);
|
||||||
}
|
|
||||||
/*
|
|
||||||
* Checkpointing another process
|
|
||||||
*/
|
|
||||||
else
|
|
||||||
#endif
|
|
||||||
{
|
|
||||||
ret = blcr_checkpoint_peer(pid, snapshot->super.local_location, &(snapshot->context_filename));
|
|
||||||
|
|
||||||
if(OPAL_SUCCESS != ret) {
|
|
||||||
*state = OPAL_CRS_ERROR;
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d)",
|
|
||||||
pid);
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
*state = blcr_current_state;
|
|
||||||
}
|
|
||||||
|
|
||||||
if(*state == OPAL_CRS_CONTINUE) {
|
|
||||||
/*
|
|
||||||
* Update the metadata file
|
|
||||||
*/
|
|
||||||
if( OPAL_SUCCESS != (ret = blcr_update_snapshot_metadata(snapshot)) ) {
|
|
||||||
*state = OPAL_CRS_ERROR;
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: checkpoint(): Error: Unable to update metadata for snapshot (%s).",
|
|
||||||
snapshot->super.reference_name);
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Return to the caller
|
|
||||||
*/
|
|
||||||
base_snapshot = &(snapshot->super);
|
|
||||||
|
|
||||||
cleanup:
|
cleanup:
|
||||||
|
if( NULL != snapshot->super.metadata ) {
|
||||||
|
fclose(snapshot->super.metadata );
|
||||||
|
snapshot->super.metadata = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -459,7 +480,7 @@ int opal_crs_blcr_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
|||||||
snapshot->super = *base_snapshot;
|
snapshot->super = *base_snapshot;
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
"crs:blcr: restart(%s, %d)", snapshot->super.reference_name, spawn_child);
|
"crs:blcr: restart(--, %d)", spawn_child);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* If we need to reconstruct the snapshot,
|
* If we need to reconstruct the snapshot,
|
||||||
@ -486,10 +507,6 @@ int opal_crs_blcr_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
|||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Restart by replacing this process
|
|
||||||
*/
|
|
||||||
/* Need to shutdown the event engine before this.
|
/* Need to shutdown the event engine before this.
|
||||||
* for some reason the BLCR checkpointer and our event engine don't get
|
* for some reason the BLCR checkpointer and our event engine don't get
|
||||||
* along very well.
|
* along very well.
|
||||||
@ -586,94 +603,6 @@ int opal_crs_blcr_enable_checkpoint(void)
|
|||||||
/*****************************
|
/*****************************
|
||||||
* Local Function Definitions
|
* Local Function Definitions
|
||||||
*****************************/
|
*****************************/
|
||||||
static int blcr_checkpoint_peer(pid_t pid, char * local_dir, char ** fname)
|
|
||||||
{
|
|
||||||
char **cr_argv = NULL;
|
|
||||||
char *cr_cmd = NULL;
|
|
||||||
int ret;
|
|
||||||
pid_t child_pid;
|
|
||||||
int exit_status = OPAL_SUCCESS;
|
|
||||||
int status, child_status;
|
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: checkpoint_peer(%d, --)", pid);
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Get the checkpoint command
|
|
||||||
*/
|
|
||||||
if ( OPAL_SUCCESS != (ret = opal_crs_blcr_checkpoint_cmd(pid, local_dir, fname, &cr_cmd)) ) {
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: checkpoint_peer: Failed to generate checkpoint command :(%d):", ret);
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
if ( NULL == (cr_argv = opal_argv_split(cr_cmd, ' ')) ) {
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: checkpoint_peer: Failed to opal_argv_split :(%d):", ret);
|
|
||||||
exit_status = OPAL_ERROR;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Fork a child to do the checkpoint
|
|
||||||
*/
|
|
||||||
blcr_current_state = OPAL_CRS_CHECKPOINT;
|
|
||||||
|
|
||||||
child_pid = fork();
|
|
||||||
|
|
||||||
if(0 == child_pid) {
|
|
||||||
/* Child Process */
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: blcr_checkpoint_peer: exec :(%s, %s):",
|
|
||||||
strdup(blcr_checkpoint_cmd),
|
|
||||||
opal_argv_join(cr_argv, ' '));
|
|
||||||
|
|
||||||
status = execvp(strdup(blcr_checkpoint_cmd), cr_argv);
|
|
||||||
|
|
||||||
if(status < 0) {
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: blcr_checkpoint_peer: Child failed to execute :(%d):", status);
|
|
||||||
}
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: blcr_checkpoint_peer: execvp returned %d", status);
|
|
||||||
}
|
|
||||||
else if(child_pid > 0) {
|
|
||||||
/* Don't waitpid here since we don't really want to restart from inside waitpid ;) */
|
|
||||||
while(OPAL_CRS_RESTART != blcr_current_state &&
|
|
||||||
OPAL_CRS_CONTINUE != blcr_current_state ) {
|
|
||||||
OPAL_THREAD_LOCK(&blcr_lock);
|
|
||||||
opal_condition_wait(&blcr_cond, &blcr_lock);
|
|
||||||
OPAL_THREAD_UNLOCK(&blcr_lock);
|
|
||||||
}
|
|
||||||
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: blcr_checkpoint_peer: Thread finished with status %d", blcr_current_state);
|
|
||||||
|
|
||||||
if(OPAL_CRS_CONTINUE == blcr_current_state) {
|
|
||||||
/* Wait for the child only if we are continuing */
|
|
||||||
if( 0 > waitpid(child_pid, &child_status, 0) ) {
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: blcr_checkpoint_peer: waitpid returned %d", child_status);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
else {
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: blcr_checkpoint_peer: fork failed :(%d):", child_pid);
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Cleanup
|
|
||||||
*/
|
|
||||||
cleanup:
|
|
||||||
if(NULL != cr_cmd)
|
|
||||||
free(cr_cmd);
|
|
||||||
if(NULL != cr_argv)
|
|
||||||
opal_argv_free(cr_argv);
|
|
||||||
|
|
||||||
return exit_status;
|
|
||||||
}
|
|
||||||
|
|
||||||
static int opal_crs_blcr_thread_callback(void *arg) {
|
static int opal_crs_blcr_thread_callback(void *arg) {
|
||||||
const struct cr_checkpoint_info *ckpt_info = cr_get_checkpoint_info();
|
const struct cr_checkpoint_info *ckpt_info = cr_get_checkpoint_info();
|
||||||
int ret;
|
int ret;
|
||||||
@ -700,6 +629,11 @@ static int opal_crs_blcr_thread_callback(void *arg) {
|
|||||||
else
|
else
|
||||||
#endif
|
#endif
|
||||||
{
|
{
|
||||||
|
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_CRS_PRE_CKPT,
|
||||||
|
OMPI_CR_INC_STATE_PREPARE)) ) {
|
||||||
|
;
|
||||||
|
}
|
||||||
|
|
||||||
ret = cr_checkpoint(0);
|
ret = cr_checkpoint(0);
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -720,6 +654,13 @@ static int opal_crs_blcr_thread_callback(void *arg) {
|
|||||||
blcr_current_state = OPAL_CRS_CONTINUE;
|
blcr_current_state = OPAL_CRS_CONTINUE;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if( OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_CRS_POST_CKPT,
|
||||||
|
(blcr_current_state == OPAL_CRS_CONTINUE ?
|
||||||
|
OMPI_CR_INC_STATE_CONTINUE :
|
||||||
|
OMPI_CR_INC_STATE_RESTART))) ) {
|
||||||
|
;
|
||||||
|
}
|
||||||
|
|
||||||
OPAL_THREAD_UNLOCK(&blcr_lock);
|
OPAL_THREAD_UNLOCK(&blcr_lock);
|
||||||
opal_condition_signal(&blcr_cond);
|
opal_condition_signal(&blcr_cond);
|
||||||
|
|
||||||
@ -747,66 +688,6 @@ static int opal_crs_blcr_signal_callback(void *arg) {
|
|||||||
return 0;
|
return 0;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int opal_crs_blcr_checkpoint_cmd(pid_t pid, char * local_dir, char **fname, char **cmd)
|
|
||||||
{
|
|
||||||
char **cr_argv = NULL;
|
|
||||||
int argc = 0, ret;
|
|
||||||
char * pid_str;
|
|
||||||
int exit_status = OPAL_SUCCESS;
|
|
||||||
char * loc_fname = NULL;
|
|
||||||
|
|
||||||
blcr_get_checkpoint_filename(fname, pid);
|
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: checkpoint_cmd(%d)", pid);
|
|
||||||
|
|
||||||
asprintf(&loc_fname, "%s/%s", local_dir, *fname);
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Build the command
|
|
||||||
*/
|
|
||||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(blcr_checkpoint_cmd)))) {
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup("--pid")))) {
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
asprintf(&pid_str, "%d", pid);
|
|
||||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(pid_str)))) {
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup("--file")))) {
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(loc_fname)))) {
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
cleanup:
|
|
||||||
if(exit_status != OPAL_SUCCESS)
|
|
||||||
*cmd = NULL;
|
|
||||||
else
|
|
||||||
*cmd = opal_argv_join(cr_argv, ' ');
|
|
||||||
|
|
||||||
if(NULL != pid_str)
|
|
||||||
free(pid_str);
|
|
||||||
if( NULL != cr_argv)
|
|
||||||
opal_argv_free(cr_argv);
|
|
||||||
if(NULL != loc_fname)
|
|
||||||
free(loc_fname);
|
|
||||||
|
|
||||||
return exit_status;
|
|
||||||
}
|
|
||||||
|
|
||||||
static int opal_crs_blcr_restart_cmd(char *fname, char **cmd)
|
static int opal_crs_blcr_restart_cmd(char *fname, char **cmd)
|
||||||
{
|
{
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
@ -833,32 +714,6 @@ static int blcr_get_checkpoint_filename(char **fname, pid_t pid)
|
|||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int blcr_update_snapshot_metadata(opal_crs_blcr_snapshot_t *snapshot) {
|
|
||||||
int exit_status = OPAL_SUCCESS;
|
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: update_snapshot_metadata(%s)", snapshot->super.reference_name);
|
|
||||||
|
|
||||||
/* Bozo check to make sure this snapshot is ours */
|
|
||||||
if ( 0 != strncmp(mca_crs_blcr_component.super.base_version.mca_component_name,
|
|
||||||
snapshot->super.component_name,
|
|
||||||
strlen(snapshot->super.component_name)) ) {
|
|
||||||
exit_status = OPAL_ERROR;
|
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
|
||||||
"crs:blcr: blcr_update_snapshot_metadata: Error: This snapshot (%s) is not intended for us (%s)\n",
|
|
||||||
snapshot->super.component_name, mca_crs_blcr_component.super.base_version.mca_component_name);
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
|
||||||
* Append to the metadata file the context filename
|
|
||||||
*/
|
|
||||||
opal_crs_base_metadata_write_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, snapshot->context_filename);
|
|
||||||
|
|
||||||
cleanup:
|
|
||||||
return exit_status;
|
|
||||||
}
|
|
||||||
|
|
||||||
static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
||||||
int ret, exit_status = OPAL_SUCCESS;
|
int ret, exit_status = OPAL_SUCCESS;
|
||||||
char **tmp_argv = NULL;
|
char **tmp_argv = NULL;
|
||||||
@ -866,16 +721,25 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
|||||||
int prev_pid;
|
int prev_pid;
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
"crs:blcr: cold_start(%s)", snapshot->super.reference_name);
|
"crs:blcr: cold_start()");
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Find the snapshot directory, read the metadata file
|
* Find the snapshot directory, read the metadata file
|
||||||
*/
|
*/
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.local_location,
|
if( NULL == snapshot->super.metadata ) {
|
||||||
|
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "r")) ) {
|
||||||
|
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||||
|
"crs:blcr: checkpoint(): Error: Unable to open the file (%s)",
|
||||||
|
snapshot->super.metadata_filename);
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.metadata,
|
||||||
&component_name, &prev_pid) ) ) {
|
&component_name, &prev_pid) ) ) {
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||||
"crs:blcr: blcr_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.",
|
"crs:blcr: blcr_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.",
|
||||||
snapshot->super.local_location, ret);
|
snapshot->super.metadata_filename, ret);
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
@ -895,15 +759,15 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
|||||||
/*
|
/*
|
||||||
* Context Filename
|
* Context Filename
|
||||||
*/
|
*/
|
||||||
opal_crs_base_metadata_read_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, &tmp_argv);
|
opal_crs_base_metadata_read_token(snapshot->super.metadata, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||||
if( NULL == tmp_argv ) {
|
if( NULL == tmp_argv ) {
|
||||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||||
"crs:blcr: blcr_cold_start: Error: Failed to read the %s token from the local checkpoint in %s",
|
"crs:blcr: blcr_cold_start: Error: Failed to read the %s token from the local checkpoint in %s",
|
||||||
CRS_METADATA_CONTEXT, snapshot->super.local_location);
|
CRS_METADATA_CONTEXT, snapshot->super.snapshot_directory);
|
||||||
exit_status = OPAL_ERROR;
|
exit_status = OPAL_ERROR;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
asprintf(&snapshot->context_filename, "%s/%s", snapshot->super.local_location, tmp_argv[0]);
|
asprintf(&snapshot->context_filename, "%s/%s", snapshot->super.snapshot_directory, tmp_argv[0]);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Reset the cold_start flag
|
* Reset the cold_start flag
|
||||||
@ -916,5 +780,75 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
|||||||
tmp_argv = NULL;
|
tmp_argv = NULL;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if( NULL != snapshot->super.metadata ) {
|
||||||
|
fclose(snapshot->super.metadata);
|
||||||
|
snapshot->super.metadata = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
static void MPIR_checkpoint_debugger_crs_hook(cr_hook_event_t event) {
|
||||||
|
opal_thread_t *my_thread_id = NULL;
|
||||||
|
my_thread_id = opal_thread_get_self();
|
||||||
|
|
||||||
|
/* Non-MPI threads */
|
||||||
|
if(event == CR_HOOK_RSTRT_NO_CALLBACKS ) {
|
||||||
|
/* wait for the MPI thread to refresh the environment for us */
|
||||||
|
while(!blcr_crdebug_refreshed_env) {
|
||||||
|
sched_yield();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
/* MPI threads */
|
||||||
|
else if(event == CR_HOOK_RSTRT_SIGNAL_CONTEXT ) {
|
||||||
|
if( opal_thread_self_compare(checkpoint_thread_id) ) {
|
||||||
|
opal_cr_refresh_environ(my_pid);
|
||||||
|
blcr_crdebug_refreshed_env = true;
|
||||||
|
} else {
|
||||||
|
while(!blcr_crdebug_refreshed_env) {
|
||||||
|
sched_yield();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Some debugging output
|
||||||
|
*/
|
||||||
|
/* Non-MPI threads */
|
||||||
|
if( event == CR_HOOK_CONT_NO_CALLBACKS ) {
|
||||||
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
|
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Continue (Non-MPI). (%d)",
|
||||||
|
(int)my_thread_id->t_handle);
|
||||||
|
}
|
||||||
|
else if(event == CR_HOOK_RSTRT_NO_CALLBACKS ) {
|
||||||
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
|
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Restart (Non-MPI). (%d)",
|
||||||
|
(int)my_thread_id->t_handle);
|
||||||
|
}
|
||||||
|
/* MPI Threads */
|
||||||
|
else if( event == CR_HOOK_CONT_SIGNAL_CONTEXT ) {
|
||||||
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
|
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Continue (MPI).");
|
||||||
|
}
|
||||||
|
else if(event == CR_HOOK_RSTRT_SIGNAL_CONTEXT ) {
|
||||||
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
|
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Restart (MPI).");
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Enter the breakpoint function.
|
||||||
|
* If no debugger intends on attaching, then this function is expected to
|
||||||
|
* return immediately.
|
||||||
|
*
|
||||||
|
* If this is an MPI thread then odds are that this is the checkpointing
|
||||||
|
* thread, in which case this function will return immediately allowing
|
||||||
|
* it to prepare the MPI library before signaling to the debugger that
|
||||||
|
* it is safe to attach, if necessary.
|
||||||
|
*/
|
||||||
|
MPIR_checkpoint_debugger_waitpoint();
|
||||||
|
|
||||||
|
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||||
|
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Finished...");
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -79,6 +79,19 @@ struct opal_crs_base_ckpt_options_1_0_0_t {
|
|||||||
bool term;
|
bool term;
|
||||||
/** Send SIGSTOP after checkpoint */
|
/** Send SIGSTOP after checkpoint */
|
||||||
bool stop;
|
bool stop;
|
||||||
|
|
||||||
|
/** INC Prep Only */
|
||||||
|
bool inc_prep_only;
|
||||||
|
|
||||||
|
/** INC Recover Only */
|
||||||
|
bool inc_recover_only;
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/** Wait for debugger to attach after checkpoint */
|
||||||
|
bool attach_debugger;
|
||||||
|
/** Do not wait for debugger to reattach after checkpoint */
|
||||||
|
bool detach_debugger;
|
||||||
|
#endif
|
||||||
};
|
};
|
||||||
typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_1_0_0_t;
|
typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_1_0_0_t;
|
||||||
typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_t;
|
typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_t;
|
||||||
@ -96,12 +109,14 @@ struct opal_crs_base_snapshot_1_0_0_t {
|
|||||||
/** MCA Component name */
|
/** MCA Component name */
|
||||||
char * component_name;
|
char * component_name;
|
||||||
|
|
||||||
/** Unique name of snapshot */
|
/** Metadata filename */
|
||||||
char * reference_name;
|
char * metadata_filename;
|
||||||
|
|
||||||
|
/** Metadata fd */
|
||||||
|
FILE * metadata;
|
||||||
|
|
||||||
/** Absolute path the the snapshot directory */
|
/** Absolute path the the snapshot directory */
|
||||||
char * local_location;
|
char * snapshot_directory;
|
||||||
char * remote_location;
|
|
||||||
|
|
||||||
/** Cold Start:
|
/** Cold Start:
|
||||||
* If we are restarting cold, then we need to recreate this structure
|
* If we are restarting cold, then we need to recreate this structure
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2009 The Trustees of Indiana University.
|
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||||
* All rights reserved.
|
* All rights reserved.
|
||||||
*
|
*
|
||||||
* $COPYRIGHT$
|
* $COPYRIGHT$
|
||||||
@ -58,25 +58,25 @@ int opal_crs_none_checkpoint(pid_t pid,
|
|||||||
opal_crs_base_ckpt_options_t *options,
|
opal_crs_base_ckpt_options_t *options,
|
||||||
opal_crs_state_type_t *state)
|
opal_crs_state_type_t *state)
|
||||||
{
|
{
|
||||||
int ret;
|
|
||||||
|
|
||||||
*state = OPAL_CRS_CONTINUE;
|
*state = OPAL_CRS_CONTINUE;
|
||||||
|
|
||||||
snapshot->component_name = strdup("none");
|
snapshot->component_name = strdup("none");
|
||||||
snapshot->reference_name = strdup("none");
|
|
||||||
snapshot->local_location = strdup("");
|
|
||||||
snapshot->remote_location = strdup("");
|
|
||||||
snapshot->cold_start = false;
|
snapshot->cold_start = false;
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Update the snapshot metadata
|
* Update the snapshot metadata
|
||||||
*/
|
*/
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, "none") ) ) {
|
if( NULL == snapshot->metadata ) {
|
||||||
|
if (NULL == (snapshot->metadata = fopen(snapshot->metadata_filename, "a")) ) {
|
||||||
opal_output(0,
|
opal_output(0,
|
||||||
"crs:none: checkpoint(): Error: Unable to write component name to the directory for (%s).",
|
"crs:none: checkpoint(): Error: Unable to open the file (%s)",
|
||||||
snapshot->reference_name);
|
snapshot->metadata_filename);
|
||||||
return ret;
|
return OPAL_ERROR;
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
fprintf(snapshot->metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->component_name);
|
||||||
|
fclose(snapshot->metadata);
|
||||||
|
snapshot->metadata = NULL;
|
||||||
|
|
||||||
if( options->stop ) {
|
if( options->stop ) {
|
||||||
opal_output(0,
|
opal_output(0,
|
||||||
@ -88,28 +88,43 @@ int opal_crs_none_checkpoint(pid_t pid,
|
|||||||
|
|
||||||
int opal_crs_none_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_child, pid_t *child_pid)
|
int opal_crs_none_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_child, pid_t *child_pid)
|
||||||
{
|
{
|
||||||
|
int exit_status = OPAL_SUCCESS;
|
||||||
char **tmp_argv = NULL;
|
char **tmp_argv = NULL;
|
||||||
char **cr_argv = NULL;
|
char **cr_argv = NULL;
|
||||||
int status;
|
int status;
|
||||||
|
|
||||||
*child_pid = getpid();
|
*child_pid = getpid();
|
||||||
|
|
||||||
opal_crs_base_metadata_read_token(base_snapshot->local_location, CRS_METADATA_CONTEXT, &tmp_argv);
|
if( NULL == base_snapshot->metadata ) {
|
||||||
|
if (NULL == (base_snapshot->metadata = fopen(base_snapshot->metadata_filename, "a")) ) {
|
||||||
|
opal_output(0,
|
||||||
|
"crs:none: checkpoint(): Error: Unable to open the file (%s)",
|
||||||
|
base_snapshot->metadata_filename);
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_crs_base_metadata_read_token(base_snapshot->metadata, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||||
|
|
||||||
if( NULL == tmp_argv ) {
|
if( NULL == tmp_argv ) {
|
||||||
opal_output(opal_crs_base_output,
|
opal_output(opal_crs_base_output,
|
||||||
"crs:none: none_restart: Error: Failed to read the %s token from the local checkpoint in %s",
|
"crs:none: none_restart: Error: Failed to read the %s token from the local checkpoint in %s",
|
||||||
CRS_METADATA_CONTEXT, base_snapshot->local_location);
|
CRS_METADATA_CONTEXT, base_snapshot->metadata_filename);
|
||||||
return OPAL_ERROR;
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
if( opal_argv_count(tmp_argv) <= 0 ) {
|
if( opal_argv_count(tmp_argv) <= 0 ) {
|
||||||
opal_output_verbose(10, opal_crs_base_output,
|
opal_output_verbose(10, opal_crs_base_output,
|
||||||
"crs:none: none_restart: No command line to exec, so just returning");
|
"crs:none: none_restart: No command line to exec, so just returning");
|
||||||
return OPAL_SUCCESS;
|
exit_status = OPAL_SUCCESS;
|
||||||
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
if ( NULL == (cr_argv = opal_argv_split(tmp_argv[0], ' ')) ) {
|
if ( NULL == (cr_argv = opal_argv_split(tmp_argv[0], ' ')) ) {
|
||||||
return OPAL_ERROR;
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
if( !spawn_child ) {
|
if( !spawn_child ) {
|
||||||
@ -126,14 +141,20 @@ int opal_crs_none_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
|||||||
}
|
}
|
||||||
opal_output(opal_crs_base_output,
|
opal_output(opal_crs_base_output,
|
||||||
"crs:none: none_restart: execvp returned %d", status);
|
"crs:none: none_restart: execvp returned %d", status);
|
||||||
return status;
|
exit_status = status;
|
||||||
|
goto cleanup;
|
||||||
} else {
|
} else {
|
||||||
opal_output(opal_crs_base_output,
|
opal_output(opal_crs_base_output,
|
||||||
"crs:none: none_restart: Spawn not implemented");
|
"crs:none: none_restart: Spawn not implemented");
|
||||||
return OPAL_ERR_NOT_IMPLEMENTED;
|
exit_status = OPAL_ERR_NOT_IMPLEMENTED;
|
||||||
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
return OPAL_SUCCESS;
|
cleanup:
|
||||||
|
fclose(base_snapshot->metadata);
|
||||||
|
base_snapshot->metadata = NULL;
|
||||||
|
|
||||||
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
|
||||||
int opal_crs_none_disable_checkpoint(void)
|
int opal_crs_none_disable_checkpoint(void)
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
.\"
|
.\"
|
||||||
.\" Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
.\" Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
.\" University Research and Technology
|
.\" University Research and Technology
|
||||||
.\" Corporation. All rights reserved.
|
.\" Corporation. All rights reserved.
|
||||||
.\" Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
|
.\" Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
|
||||||
@ -89,10 +89,6 @@ The following MCA parameters apply to all components:
|
|||||||
crs_base_verbose
|
crs_base_verbose
|
||||||
Set the verbosity level for all components. Default is 0, or silent except on error.
|
Set the verbosity level for all components. Default is 0, or silent except on error.
|
||||||
.
|
.
|
||||||
.TP
|
|
||||||
crs_base_snapshot_dir
|
|
||||||
The directory to store the checkpoint snapshots. Default is \fB/tmp\fP.
|
|
||||||
.
|
|
||||||
.\" Self Component
|
.\" Self Component
|
||||||
.\" ******************
|
.\" ******************
|
||||||
.SS self CRS Component
|
.SS self CRS Component
|
||||||
|
@ -285,17 +285,11 @@ int opal_crs_self_checkpoint(pid_t pid,
|
|||||||
/*
|
/*
|
||||||
* Setup for snapshot directory creation
|
* Setup for snapshot directory creation
|
||||||
*/
|
*/
|
||||||
if(NULL != snapshot->super.reference_name)
|
snapshot->super = *base_snapshot;
|
||||||
free(snapshot->super.reference_name);
|
#if 0
|
||||||
snapshot->super.reference_name = strdup(base_snapshot->reference_name);
|
snapshot->super.snapshot_directory = strdup(base_snapshot->snapshot_directory);
|
||||||
|
snapshot->super.metadata_filename = strdup(base_snapshot->metadata_filename);
|
||||||
if(NULL != snapshot->super.local_location)
|
#endif
|
||||||
free(snapshot->super.local_location);
|
|
||||||
snapshot->super.local_location = strdup(base_snapshot->local_location);
|
|
||||||
|
|
||||||
if(NULL != snapshot->super.remote_location)
|
|
||||||
free(snapshot->super.remote_location);
|
|
||||||
snapshot->super.remote_location = strdup(base_snapshot->remote_location);
|
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: checkpoint(%d, ---)", pid);
|
"crs:self: checkpoint(%d, ---)", pid);
|
||||||
@ -310,13 +304,16 @@ int opal_crs_self_checkpoint(pid_t pid,
|
|||||||
* Update the snapshot metadata
|
* Update the snapshot metadata
|
||||||
*/
|
*/
|
||||||
snapshot->super.component_name = strdup(mca_crs_self_component.super.base_version.mca_component_name);
|
snapshot->super.component_name = strdup(mca_crs_self_component.super.base_version.mca_component_name);
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, snapshot->super.component_name) ) ) {
|
if( NULL == snapshot->super.metadata ) {
|
||||||
|
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) {
|
||||||
opal_output(mca_crs_self_component.super.output_handle,
|
opal_output(mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: checkpoint(): Error: Unable to write component name to the directory for (%s).",
|
"crs:self: checkpoint(): Error: Unable to open the file (%s)",
|
||||||
snapshot->super.reference_name);
|
snapshot->super.metadata_filename);
|
||||||
exit_status = ret;
|
exit_status = OPAL_ERROR;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
}
|
||||||
|
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->super.component_name);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Call the user callback function
|
* Call the user callback function
|
||||||
@ -350,7 +347,7 @@ int opal_crs_self_checkpoint(pid_t pid,
|
|||||||
*state = OPAL_CRS_ERROR;
|
*state = OPAL_CRS_ERROR;
|
||||||
opal_output(mca_crs_self_component.super.output_handle,
|
opal_output(mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: checkpoint(): Error: Unable to update metadata for snapshot (%s).",
|
"crs:self: checkpoint(): Error: Unable to update metadata for snapshot (%s).",
|
||||||
snapshot->super.reference_name);
|
snapshot->super.metadata_filename);
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
@ -392,7 +389,7 @@ int opal_crs_self_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
|||||||
snapshot->super = *base_snapshot;
|
snapshot->super = *base_snapshot;
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: restart(%s, %d)", snapshot->super.reference_name, spawn_child);
|
"crs:self: restart(%d)", spawn_child);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* If we need to reconstruct the snapshot
|
* If we need to reconstruct the snapshot
|
||||||
@ -675,16 +672,25 @@ static int self_cold_start(opal_crs_self_snapshot_t *snapshot) {
|
|||||||
int prev_pid;
|
int prev_pid;
|
||||||
|
|
||||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: cold_start(%s)", snapshot->super.reference_name);
|
"crs:self: cold_start()");
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Find the snapshot directory, read the metadata file
|
* Find the snapshot directory, read the metadata file
|
||||||
*/
|
*/
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.local_location,
|
if( NULL == snapshot->super.metadata ) {
|
||||||
|
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) {
|
||||||
|
opal_output(mca_crs_self_component.super.output_handle,
|
||||||
|
"crs:self: checkpoint(): Error: Unable to open the file (%s)",
|
||||||
|
snapshot->super.metadata_filename);
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.metadata,
|
||||||
&component_name, &prev_pid) ) ) {
|
&component_name, &prev_pid) ) ) {
|
||||||
opal_output(mca_crs_self_component.super.output_handle,
|
opal_output(mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: self_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.",
|
"crs:self: self_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.",
|
||||||
snapshot->super.local_location, ret);
|
snapshot->super.metadata_filename, ret);
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
@ -705,11 +711,11 @@ static int self_cold_start(opal_crs_self_snapshot_t *snapshot) {
|
|||||||
* Restart command
|
* Restart command
|
||||||
* JJH: Command lines limited to 256 chars.
|
* JJH: Command lines limited to 256 chars.
|
||||||
*/
|
*/
|
||||||
opal_crs_base_metadata_read_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, &tmp_argv);
|
opal_crs_base_metadata_read_token(snapshot->super.metadata, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||||
if( NULL == tmp_argv ) {
|
if( NULL == tmp_argv ) {
|
||||||
opal_output(mca_crs_self_component.super.output_handle,
|
opal_output(mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: self_cold_start: Error: Failed to read the %s token from the local checkpoint in %s",
|
"crs:self: self_cold_start: Error: Failed to read the %s token from the local checkpoint in %s",
|
||||||
CRS_METADATA_CONTEXT, snapshot->super.local_location);
|
CRS_METADATA_CONTEXT, snapshot->super.snapshot_directory);
|
||||||
exit_status = OPAL_ERROR;
|
exit_status = OPAL_ERROR;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
@ -742,13 +748,13 @@ static int self_update_snapshot_metadata(opal_crs_self_snapshot_t *snapshot) {
|
|||||||
|
|
||||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||||
"crs:self: update_snapshot_metadata(%s)",
|
"crs:self: update_snapshot_metadata(%s)",
|
||||||
snapshot->super.reference_name);
|
snapshot->super.metadata_filename);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Append to the metadata file the command line to restart with
|
* Append to the metadata file the command line to restart with
|
||||||
* - How user wants us to restart
|
* - How user wants us to restart
|
||||||
*/
|
*/
|
||||||
opal_crs_base_metadata_write_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, snapshot->cmd_line);
|
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_CONTEXT, snapshot->cmd_line);
|
||||||
|
|
||||||
cleanup:
|
cleanup:
|
||||||
return exit_status;
|
return exit_status;
|
||||||
|
@ -74,9 +74,21 @@
|
|||||||
/******************
|
/******************
|
||||||
* Global Var Decls
|
* Global Var Decls
|
||||||
******************/
|
******************/
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
static opal_thread_t **opal_cr_debug_free_threads = NULL;
|
||||||
|
static int opal_cr_debug_num_free_threads = 0;
|
||||||
|
static int opal_cr_debug_threads_already_waiting = false;
|
||||||
|
|
||||||
|
int MPIR_debug_with_checkpoint = 0;
|
||||||
|
static volatile int MPIR_checkpoint_debug_gate = 0;
|
||||||
|
|
||||||
|
int opal_cr_debug_signal = 0;
|
||||||
|
#endif
|
||||||
|
|
||||||
bool opal_cr_stall_check = false;
|
bool opal_cr_stall_check = false;
|
||||||
bool opal_cr_currently_stalled = false;
|
bool opal_cr_currently_stalled = false;
|
||||||
int opal_cr_output;
|
int opal_cr_output;
|
||||||
|
int opal_cr_initalized = 0;
|
||||||
|
|
||||||
static double opal_cr_get_time(void);
|
static double opal_cr_get_time(void);
|
||||||
static void display_indv_timer_core(double diff, char *str);
|
static void display_indv_timer_core(double diff, char *str);
|
||||||
@ -89,10 +101,11 @@ int opal_cr_timing_target_rank = 0;
|
|||||||
/******************
|
/******************
|
||||||
* Local Functions & Var Decls
|
* Local Functions & Var Decls
|
||||||
******************/
|
******************/
|
||||||
static int extract_env_vars(int prev_pid);
|
static int extract_env_vars(int prev_pid, char * file_name);
|
||||||
|
|
||||||
static void opal_cr_sigpipe_debug_signal_handler (int signo);
|
static void opal_cr_sigpipe_debug_signal_handler (int signo);
|
||||||
|
|
||||||
|
static opal_cr_user_inc_callback_fn_t cur_user_coord_callback[OMPI_CR_INC_MAX] = {NULL};
|
||||||
static opal_cr_coord_callback_fn_t cur_coord_callback = NULL;
|
static opal_cr_coord_callback_fn_t cur_coord_callback = NULL;
|
||||||
static opal_cr_notify_callback_fn_t cur_notify_callback = NULL;
|
static opal_cr_notify_callback_fn_t cur_notify_callback = NULL;
|
||||||
|
|
||||||
@ -179,13 +192,11 @@ int opal_cr_set_enabled(bool en)
|
|||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
int opal_cr_initalized = 0;
|
|
||||||
|
|
||||||
int opal_cr_init(void )
|
int opal_cr_init(void )
|
||||||
{
|
{
|
||||||
int ret, exit_status = OPAL_SUCCESS;
|
int ret, exit_status = OPAL_SUCCESS;
|
||||||
opal_cr_coord_callback_fn_t prev_coord_func;
|
opal_cr_coord_callback_fn_t prev_coord_func;
|
||||||
int val;
|
int val, t;
|
||||||
|
|
||||||
if( ++opal_cr_initalized != 1 ) {
|
if( ++opal_cr_initalized != 1 ) {
|
||||||
if( opal_cr_initalized < 1 ) {
|
if( opal_cr_initalized < 1 ) {
|
||||||
@ -265,9 +276,9 @@ int opal_cr_init(void )
|
|||||||
opal_cr_thread_sleep_check = val;
|
opal_cr_thread_sleep_check = val;
|
||||||
|
|
||||||
mca_base_param_reg_int_name("opal_cr", "thread_sleep_wait",
|
mca_base_param_reg_int_name("opal_cr", "thread_sleep_wait",
|
||||||
"Time to sleep waiting for process to exit MPI library (Default: 0)",
|
"Time to sleep waiting for process to exit MPI library (Default: 1000)",
|
||||||
false, false,
|
false, false,
|
||||||
0, &val);
|
1000, &val);
|
||||||
opal_cr_thread_sleep_wait = val;
|
opal_cr_thread_sleep_wait = val;
|
||||||
|
|
||||||
opal_output_verbose(10, opal_cr_output,
|
opal_output_verbose(10, opal_cr_output,
|
||||||
@ -285,6 +296,19 @@ int opal_cr_init(void )
|
|||||||
opal_output_verbose(10, opal_cr_output,
|
opal_output_verbose(10, opal_cr_output,
|
||||||
"opal_cr: init: Is a tool program: %d",
|
"opal_cr: init: Is a tool program: %d",
|
||||||
val);
|
val);
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
mca_base_param_reg_int_name("opal_cr", "enable_crdebug",
|
||||||
|
"Enable checkpoint/restart debugging",
|
||||||
|
false, false,
|
||||||
|
0,
|
||||||
|
&val);
|
||||||
|
MPIR_debug_with_checkpoint = OPAL_INT_TO_BOOL(val);
|
||||||
|
|
||||||
|
opal_output_verbose(10, opal_cr_output,
|
||||||
|
"opal_cr: init: C/R Debugging Enabled [%s]\n",
|
||||||
|
(MPIR_debug_with_checkpoint ? "True": "False"));
|
||||||
|
#endif
|
||||||
|
|
||||||
#ifndef __WINDOWS__
|
#ifndef __WINDOWS__
|
||||||
mca_base_param_reg_int_name("opal_cr", "signal",
|
mca_base_param_reg_int_name("opal_cr", "signal",
|
||||||
"Checkpoint/Restart signal used to initialize an OPAL Only checkpoint of a program",
|
"Checkpoint/Restart signal used to initialize an OPAL Only checkpoint of a program",
|
||||||
@ -327,10 +351,36 @@ int opal_cr_init(void )
|
|||||||
opal_cr_is_tool = true; /* no support for CR on Windows yet */
|
opal_cr_is_tool = true; /* no support for CR on Windows yet */
|
||||||
#endif /* __WINDOWS__ */
|
#endif /* __WINDOWS__ */
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
opal_cr_debug_num_free_threads = 3;
|
||||||
|
opal_cr_debug_free_threads = (opal_thread_t **)malloc(sizeof(opal_thread_t *) * opal_cr_debug_num_free_threads );
|
||||||
|
for(t = 0; t < opal_cr_debug_num_free_threads; ++t ) {
|
||||||
|
opal_cr_debug_free_threads[t] = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
mca_base_param_reg_int_name("opal_cr", "crdebug_signal",
|
||||||
|
"Checkpoint/Restart signal used to hold threads when debugging",
|
||||||
|
false, false,
|
||||||
|
SIGTSTP,
|
||||||
|
&opal_cr_debug_signal);
|
||||||
|
|
||||||
|
opal_output_verbose(10, opal_cr_output,
|
||||||
|
"opal_cr: init: Checkpoint Signal (Debug): %d",
|
||||||
|
opal_cr_debug_signal);
|
||||||
|
if( SIG_ERR == signal(opal_cr_debug_signal, MPIR_checkpoint_debugger_signal_handler) ) {
|
||||||
|
opal_output(opal_cr_output,
|
||||||
|
"opal_cr: init: Failed to register C/R debug signal (%d)",
|
||||||
|
opal_cr_debug_signal);
|
||||||
|
}
|
||||||
|
#else
|
||||||
|
/* Silence a compiler warning */
|
||||||
|
t = 0;
|
||||||
|
#endif
|
||||||
|
|
||||||
mca_base_param_reg_string_name("opal_cr", "tmp_dir",
|
mca_base_param_reg_string_name("opal_cr", "tmp_dir",
|
||||||
"Temporary directory to place rendezvous files for a checkpoint",
|
"Temporary directory to place rendezvous files for a checkpoint",
|
||||||
false, false,
|
false, false,
|
||||||
"/tmp",
|
opal_tmp_directory(),
|
||||||
&opal_cr_pipe_dir);
|
&opal_cr_pipe_dir);
|
||||||
|
|
||||||
opal_output_verbose(10, opal_cr_output,
|
opal_output_verbose(10, opal_cr_output,
|
||||||
@ -436,6 +486,14 @@ int opal_cr_finalize(void)
|
|||||||
opal_cr_checkpoint_request = OPAL_CR_STATUS_TERM;
|
opal_cr_checkpoint_request = OPAL_CR_STATUS_TERM;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
if( NULL != opal_cr_debug_free_threads ) {
|
||||||
|
free( opal_cr_debug_free_threads );
|
||||||
|
opal_cr_debug_free_threads = NULL;
|
||||||
|
}
|
||||||
|
opal_cr_debug_num_free_threads = 0;
|
||||||
|
#endif
|
||||||
|
|
||||||
if (NULL != opal_cr_pipe_dir) {
|
if (NULL != opal_cr_pipe_dir) {
|
||||||
free(opal_cr_pipe_dir);
|
free(opal_cr_pipe_dir);
|
||||||
opal_cr_pipe_dir = NULL;
|
opal_cr_pipe_dir = NULL;
|
||||||
@ -523,6 +581,14 @@ int opal_cr_inc_core_prep(void)
|
|||||||
{
|
{
|
||||||
int ret;
|
int ret;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Call User Level INC
|
||||||
|
*/
|
||||||
|
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_PRE_CRS_PRE_MPI,
|
||||||
|
OMPI_CR_INC_STATE_PREPARE)) ) {
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Use the registered coordination routine
|
* Use the registered coordination routine
|
||||||
*/
|
*/
|
||||||
@ -535,6 +601,14 @@ int opal_cr_inc_core_prep(void)
|
|||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Call User Level INC
|
||||||
|
*/
|
||||||
|
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_PRE_CRS_POST_MPI,
|
||||||
|
OMPI_CR_INC_STATE_PREPARE)) ) {
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
core_prev_pid = getpid();
|
core_prev_pid = getpid();
|
||||||
|
|
||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
@ -575,7 +649,7 @@ int opal_cr_inc_core_ckpt(pid_t pid,
|
|||||||
* If restarting read environment stuff that opal-restart left us.
|
* If restarting read environment stuff that opal-restart left us.
|
||||||
*/
|
*/
|
||||||
if(*state == OPAL_CRS_RESTART) {
|
if(*state == OPAL_CRS_RESTART) {
|
||||||
extract_env_vars(core_prev_pid);
|
opal_cr_refresh_environ(core_prev_pid);
|
||||||
opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE;
|
opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -585,6 +659,7 @@ int opal_cr_inc_core_ckpt(pid_t pid,
|
|||||||
int opal_cr_inc_core_recover(int state)
|
int opal_cr_inc_core_recover(int state)
|
||||||
{
|
{
|
||||||
int ret;
|
int ret;
|
||||||
|
opal_cr_user_inc_callback_state_t cb_state;
|
||||||
|
|
||||||
if( opal_cr_checkpointing_state != OPAL_CR_STATUS_TERM &&
|
if( opal_cr_checkpointing_state != OPAL_CR_STATUS_TERM &&
|
||||||
opal_cr_checkpointing_state != OPAL_CR_STATUS_CONTINUE &&
|
opal_cr_checkpointing_state != OPAL_CR_STATUS_CONTINUE &&
|
||||||
@ -599,11 +674,29 @@ int opal_cr_inc_core_recover(int state)
|
|||||||
* If restarting read environment stuff that opal-restart left us.
|
* If restarting read environment stuff that opal-restart left us.
|
||||||
*/
|
*/
|
||||||
else if(state == OPAL_CRS_RESTART) {
|
else if(state == OPAL_CRS_RESTART) {
|
||||||
extract_env_vars(core_prev_pid);
|
opal_cr_refresh_environ(core_prev_pid);
|
||||||
opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE;
|
opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Call User Level INC
|
||||||
|
*/
|
||||||
|
if( OPAL_CRS_CONTINUE == state ) {
|
||||||
|
cb_state = OMPI_CR_INC_STATE_CONTINUE;
|
||||||
|
}
|
||||||
|
else if( OPAL_CRS_RESTART == state ) {
|
||||||
|
cb_state = OMPI_CR_INC_STATE_RESTART;
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
cb_state = OMPI_CR_INC_STATE_ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_POST_CRS_PRE_MPI,
|
||||||
|
cb_state)) ) {
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Use the registered coordination routine
|
* Use the registered coordination routine
|
||||||
*/
|
*/
|
||||||
@ -616,6 +709,15 @@ int opal_cr_inc_core_recover(int state)
|
|||||||
return ret;
|
return ret;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_POST_CRS_POST_MPI,
|
||||||
|
cb_state)) ) {
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
opal_cr_debug_clear_current_ckpt_thread();
|
||||||
|
#endif
|
||||||
|
|
||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -717,6 +819,39 @@ int opal_cr_reg_notify_callback(opal_cr_notify_callback_fn_t new_func,
|
|||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
int opal_cr_user_inc_register_callback(opal_cr_user_inc_callback_event_t event,
|
||||||
|
opal_cr_user_inc_callback_fn_t function,
|
||||||
|
opal_cr_user_inc_callback_fn_t *prev_function)
|
||||||
|
{
|
||||||
|
if( event < 0 || event >= OMPI_CR_INC_MAX ) {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL != cur_user_coord_callback[event] ) {
|
||||||
|
*prev_function = cur_user_coord_callback[event];
|
||||||
|
} else {
|
||||||
|
*prev_function = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
cur_user_coord_callback[event] = function;
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int trigger_user_inc_callback(opal_cr_user_inc_callback_event_t event,
|
||||||
|
opal_cr_user_inc_callback_state_t state)
|
||||||
|
{
|
||||||
|
if( NULL == cur_user_coord_callback[event] ) {
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
if( event < 0 || event >= OMPI_CR_INC_MAX ) {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
|
||||||
|
return ((cur_user_coord_callback[event])(event, state));
|
||||||
|
}
|
||||||
|
|
||||||
int opal_cr_reg_coord_callback(opal_cr_coord_callback_fn_t new_func,
|
int opal_cr_reg_coord_callback(opal_cr_coord_callback_fn_t new_func,
|
||||||
opal_cr_coord_callback_fn_t *prev_func)
|
opal_cr_coord_callback_fn_t *prev_func)
|
||||||
{
|
{
|
||||||
@ -738,14 +873,61 @@ int opal_cr_reg_coord_callback(opal_cr_coord_callback_fn_t new_func,
|
|||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
int opal_cr_refresh_environ(int prev_pid) {
|
||||||
|
int val;
|
||||||
|
char *file_name = NULL;
|
||||||
|
struct stat file_status;
|
||||||
|
|
||||||
|
if( 0 >= prev_pid ) {
|
||||||
|
prev_pid = getpid();
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Make sure the file exists. If it doesn't then this means 2 things:
|
||||||
|
* 1) We have already executed this function, and
|
||||||
|
* 2) The file has been deleted on the previous round.
|
||||||
|
*/
|
||||||
|
asprintf(&file_name, "%s/%s-%d", opal_tmp_directory(), OPAL_CR_BASE_ENV_NAME, prev_pid);
|
||||||
|
if(0 != stat(file_name, &file_status) ){
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
opal_unsetenv(mca_base_param_env_var("opal_cr_enable_crdebug"), &environ);
|
||||||
|
#endif
|
||||||
|
|
||||||
|
extract_env_vars(prev_pid, file_name);
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
mca_base_param_reg_int_name("opal_cr", "enable_crdebug",
|
||||||
|
"Enable checkpoint/restart debugging",
|
||||||
|
false, false,
|
||||||
|
0,
|
||||||
|
&val);
|
||||||
|
MPIR_debug_with_checkpoint = OPAL_INT_TO_BOOL(val);
|
||||||
|
|
||||||
|
opal_output_verbose(10, opal_cr_output,
|
||||||
|
"opal_cr: init: C/R Debugging Enabled [%s] (refresh)\n",
|
||||||
|
(MPIR_debug_with_checkpoint ? "True": "False"));
|
||||||
|
#else
|
||||||
|
val = 0; /* Silence Compiler warning */
|
||||||
|
#endif
|
||||||
|
|
||||||
|
if( NULL != file_name ){
|
||||||
|
free(file_name);
|
||||||
|
file_name = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Extract environment variables from a saved file
|
* Extract environment variables from a saved file
|
||||||
* and place them in the environment.
|
* and place them in the environment.
|
||||||
*/
|
*/
|
||||||
static int extract_env_vars(int prev_pid)
|
static int extract_env_vars(int prev_pid, char * file_name)
|
||||||
{
|
{
|
||||||
int exit_status = OPAL_SUCCESS;
|
int exit_status = OPAL_SUCCESS;
|
||||||
char *file_name = NULL;
|
|
||||||
FILE *env_data = NULL;
|
FILE *env_data = NULL;
|
||||||
int len = OPAL_PATH_MAX;
|
int len = OPAL_PATH_MAX;
|
||||||
char * tmp_str = NULL;
|
char * tmp_str = NULL;
|
||||||
@ -758,12 +940,6 @@ static int extract_env_vars(int prev_pid)
|
|||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
|
||||||
* JJH: Hardcode /tmp here, really only need an agreed upon file to
|
|
||||||
* transfer the environment variables.
|
|
||||||
*/
|
|
||||||
asprintf(&file_name, "/tmp/%s-%d", OPAL_CR_BASE_ENV_NAME, prev_pid);
|
|
||||||
|
|
||||||
if (NULL == (env_data = fopen(file_name, "r")) ) {
|
if (NULL == (env_data = fopen(file_name, "r")) ) {
|
||||||
exit_status = OPAL_ERROR;
|
exit_status = OPAL_ERROR;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
@ -805,17 +981,12 @@ static int extract_env_vars(int prev_pid)
|
|||||||
tmp_str = NULL;
|
tmp_str = NULL;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
cleanup:
|
cleanup:
|
||||||
if( NULL != env_data ) {
|
if( NULL != env_data ) {
|
||||||
fclose(env_data);
|
fclose(env_data);
|
||||||
}
|
}
|
||||||
unlink(file_name);
|
unlink(file_name);
|
||||||
|
|
||||||
if( NULL != file_name ){
|
|
||||||
free(file_name);
|
|
||||||
}
|
|
||||||
|
|
||||||
if( NULL != tmp_str ){
|
if( NULL != tmp_str ){
|
||||||
free(tmp_str);
|
free(tmp_str);
|
||||||
}
|
}
|
||||||
@ -871,6 +1042,10 @@ static void* opal_cr_thread_fn(opal_object_t *obj)
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
opal_cr_debug_free_threads[1] = opal_thread_get_self();
|
||||||
|
#endif
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Wait to become active
|
* Wait to become active
|
||||||
*/
|
*/
|
||||||
@ -1106,3 +1281,129 @@ void opal_cr_display_all_timers(void)
|
|||||||
|
|
||||||
opal_output(0, "OPAL CR Timing: ******************** Summary End\n");
|
opal_output(0, "OPAL CR Timing: ******************** Summary End\n");
|
||||||
}
|
}
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
int opal_cr_debug_set_current_ckpt_thread_self(void)
|
||||||
|
{
|
||||||
|
int t;
|
||||||
|
|
||||||
|
if( NULL == opal_cr_debug_free_threads ) {
|
||||||
|
opal_cr_debug_num_free_threads = 3;
|
||||||
|
opal_cr_debug_free_threads = (opal_thread_t **)malloc(sizeof(opal_thread_t *) * opal_cr_debug_num_free_threads );
|
||||||
|
for(t = 0; t < opal_cr_debug_num_free_threads; ++t ) {
|
||||||
|
opal_cr_debug_free_threads[t] = NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_cr_debug_free_threads[0] = opal_thread_get_self();
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int opal_cr_debug_clear_current_ckpt_thread(void)
|
||||||
|
{
|
||||||
|
opal_cr_debug_free_threads[0] = NULL;
|
||||||
|
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int MPIR_checkpoint_debugger_detach(void) {
|
||||||
|
/* This function is meant to be a noop function for checkpoint/restart
|
||||||
|
* enabled debugging functionality */
|
||||||
|
#if 0
|
||||||
|
/* Once the debugger can successfully force threads into the function below,
|
||||||
|
* then we can uncomment this line */
|
||||||
|
if( MPIR_debug_with_checkpoint ) {
|
||||||
|
opal_cr_debug_threads_already_waiting = true;
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
void MPIR_checkpoint_debugger_signal_handler(int signo)
|
||||||
|
{
|
||||||
|
opal_output_verbose(1, opal_cr_output,
|
||||||
|
"crs: MPIR_checkpoint_debugger_signal_handler(): Enter Debug signal handler...");
|
||||||
|
|
||||||
|
MPIR_checkpoint_debugger_waitpoint();
|
||||||
|
|
||||||
|
opal_output_verbose(1, opal_cr_output,
|
||||||
|
"crs: MPIR_checkpoint_debugger_signal_handler(): Leave Debug signal handler...");
|
||||||
|
}
|
||||||
|
|
||||||
|
void *MPIR_checkpoint_debugger_waitpoint(void)
|
||||||
|
{
|
||||||
|
int t;
|
||||||
|
opal_thread_t *thr = NULL;
|
||||||
|
|
||||||
|
thr = opal_thread_get_self();
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Sanity check, if the debugger is not going to attach, then do not wait
|
||||||
|
* Make sure to open the debug gate, so that threads can get out
|
||||||
|
*/
|
||||||
|
if( !MPIR_debug_with_checkpoint ) {
|
||||||
|
opal_output_verbose(1, opal_cr_output,
|
||||||
|
"crs: MPIR_checkpoint_debugger_waitpoint(): Debugger is not attaching... (%d)",
|
||||||
|
(int)thr->t_handle);
|
||||||
|
MPIR_checkpoint_debug_gate = 1;
|
||||||
|
return NULL;
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
opal_output_verbose(1, opal_cr_output,
|
||||||
|
"crs: MPIR_checkpoint_debugger_waitpoint(): Waiting for the Debugger to attach... (%d)",
|
||||||
|
(int)thr->t_handle);
|
||||||
|
MPIR_checkpoint_debug_gate = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Let special threads escape without waiting, they will wait later
|
||||||
|
*/
|
||||||
|
for(t = 0; t < opal_cr_debug_num_free_threads; ++t) {
|
||||||
|
if( opal_cr_debug_free_threads[t] != NULL &&
|
||||||
|
opal_thread_self_compare(opal_cr_debug_free_threads[t]) ) {
|
||||||
|
opal_output_verbose(1, opal_cr_output,
|
||||||
|
"crs: MPIR_checkpoint_debugger_waitpoint(): Checkpointing thread does not wait here... (%d)",
|
||||||
|
(int)thr->t_handle);
|
||||||
|
return NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Force all other threads into the waiting function,
|
||||||
|
* unless they are already in there, then just return so we do not nest
|
||||||
|
* calls into this wait function and potentially confuse the debugger.
|
||||||
|
*/
|
||||||
|
if( opal_cr_debug_threads_already_waiting ) {
|
||||||
|
opal_output_verbose(1, opal_cr_output,
|
||||||
|
"crs: MPIR_checkpoint_debugger_waitpoint(): Threads are already waiting from debugger detach, do not wait here... (%d)",
|
||||||
|
(int)thr->t_handle);
|
||||||
|
return NULL;
|
||||||
|
} else {
|
||||||
|
opal_output_verbose(1, opal_cr_output,
|
||||||
|
"crs: MPIR_checkpoint_debugger_waitpoint(): Wait... (%d)",
|
||||||
|
(int)thr->t_handle);
|
||||||
|
return MPIR_checkpoint_debugger_breakpoint();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* A tight loop to wait for debugger to release this process from the
|
||||||
|
* breakpoint.
|
||||||
|
*/
|
||||||
|
void *MPIR_checkpoint_debugger_breakpoint(void)
|
||||||
|
{
|
||||||
|
/* spin until debugger attaches and releases us */
|
||||||
|
while (MPIR_checkpoint_debug_gate == 0) {
|
||||||
|
#if defined(__WINDOWS__)
|
||||||
|
Sleep(100); /* milliseconds */
|
||||||
|
#elif defined(HAVE_USLEEP)
|
||||||
|
usleep(100000); /* microseconds */
|
||||||
|
#else
|
||||||
|
sleep(1); /* seconds */
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
opal_cr_debug_threads_already_waiting = false;
|
||||||
|
return NULL;
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
@ -91,6 +91,44 @@ typedef enum opal_cr_ckpt_cmd_state_t opal_cr_ckpt_cmd_state_t;
|
|||||||
/* The current state of a checkpoint operation */
|
/* The current state of a checkpoint operation */
|
||||||
OPAL_DECLSPEC extern int opal_cr_checkpointing_state;
|
OPAL_DECLSPEC extern int opal_cr_checkpointing_state;
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_CRDEBUG == 1
|
||||||
|
/* Whether or not C/R Debugging is enabled for this process */
|
||||||
|
OPAL_DECLSPEC extern int MPIR_debug_with_checkpoint;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Set/clear the current thread id for the checkpointing thread
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC int opal_cr_debug_set_current_ckpt_thread_self(void);
|
||||||
|
OPAL_DECLSPEC int opal_cr_debug_clear_current_ckpt_thread(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* This MPI Debugger function needs to be accessed here and have a specific
|
||||||
|
* name. Thus we are breaking the traditional naming conventions to provide this functionality.
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC int MPIR_checkpoint_debugger_detach(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A tight loop to wait for debugger to release this process from the
|
||||||
|
* breakpoint.
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC void *MPIR_checkpoint_debugger_breakpoint(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A function for the debugger or CRS to force all threads into
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC void *MPIR_checkpoint_debugger_waitpoint(void);
|
||||||
|
|
||||||
|
/**
|
||||||
|
* A signal handler to force all threads to wait when debugger detaches
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC void MPIR_checkpoint_debugger_signal_handler(int signo);
|
||||||
|
#endif
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Refresh environment variables after a restart
|
||||||
|
*/
|
||||||
|
OPAL_DECLSPEC int opal_cr_refresh_environ(int prev_pid);
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* If this is an application that doesn't want to have
|
* If this is an application that doesn't want to have
|
||||||
* a notification callback installed, set this to false.
|
* a notification callback installed, set this to false.
|
||||||
@ -253,6 +291,42 @@ typedef enum opal_cr_ckpt_cmd_state_t opal_cr_ckpt_cmd_state_t;
|
|||||||
int *state);
|
int *state);
|
||||||
OPAL_DECLSPEC int opal_cr_inc_core_recover(int state);
|
OPAL_DECLSPEC int opal_cr_inc_core_recover(int state);
|
||||||
|
|
||||||
|
|
||||||
|
/*******************************
|
||||||
|
* User Coordination Routines
|
||||||
|
*******************************/
|
||||||
|
typedef enum {
|
||||||
|
OMPI_CR_INC_PRE_CRS_PRE_MPI = 0,
|
||||||
|
OMPI_CR_INC_PRE_CRS_POST_MPI = 1,
|
||||||
|
OMPI_CR_INC_CRS_PRE_CKPT = 2,
|
||||||
|
OMPI_CR_INC_CRS_POST_CKPT = 3,
|
||||||
|
OMPI_CR_INC_POST_CRS_PRE_MPI = 4,
|
||||||
|
OMPI_CR_INC_POST_CRS_POST_MPI = 5,
|
||||||
|
OMPI_CR_INC_MAX = 6
|
||||||
|
} opal_cr_user_inc_callback_event_t;
|
||||||
|
|
||||||
|
typedef enum {
|
||||||
|
OMPI_CR_INC_STATE_PREPARE = 0,
|
||||||
|
OMPI_CR_INC_STATE_CONTINUE = 1,
|
||||||
|
OMPI_CR_INC_STATE_RESTART = 2,
|
||||||
|
OMPI_CR_INC_STATE_ERROR = 3
|
||||||
|
} opal_cr_user_inc_callback_state_t;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* User coordination callback routine
|
||||||
|
*/
|
||||||
|
typedef int (*opal_cr_user_inc_callback_fn_t)(opal_cr_user_inc_callback_event_t event,
|
||||||
|
opal_cr_user_inc_callback_state_t state);
|
||||||
|
|
||||||
|
OPAL_DECLSPEC int opal_cr_user_inc_register_callback
|
||||||
|
(opal_cr_user_inc_callback_event_t event,
|
||||||
|
opal_cr_user_inc_callback_fn_t function,
|
||||||
|
opal_cr_user_inc_callback_fn_t *prev_function);
|
||||||
|
|
||||||
|
OPAL_DECLSPEC int trigger_user_inc_callback(opal_cr_user_inc_callback_event_t event,
|
||||||
|
opal_cr_user_inc_callback_state_t state);
|
||||||
|
|
||||||
|
|
||||||
/*******************************
|
/*******************************
|
||||||
* Coordination Routines
|
* Coordination Routines
|
||||||
*******************************/
|
*******************************/
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -43,6 +43,9 @@
|
|||||||
#include "opal/event/event.h"
|
#include "opal/event/event.h"
|
||||||
#include "opal/runtime/opal_progress.h"
|
#include "opal/runtime/opal_progress.h"
|
||||||
#include "opal/mca/carto/base/base.h"
|
#include "opal/mca/carto/base/base.h"
|
||||||
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
#include "opal/runtime/opal_cr.h"
|
#include "opal/runtime/opal_cr.h"
|
||||||
#include "opal/mca/crs/base/base.h"
|
#include "opal/mca/crs/base/base.h"
|
||||||
@ -112,6 +115,10 @@ opal_finalize(void)
|
|||||||
/* close the checkpoint and restart service */
|
/* close the checkpoint and restart service */
|
||||||
opal_cr_finalize();
|
opal_cr_finalize();
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
opal_compress_base_close();
|
||||||
|
#endif
|
||||||
|
|
||||||
opal_progress_finalize();
|
opal_progress_finalize();
|
||||||
|
|
||||||
opal_event_fini();
|
opal_event_fini();
|
||||||
|
@ -1,5 +1,5 @@
|
|||||||
/*
|
/*
|
||||||
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
* University Research and Technology
|
* University Research and Technology
|
||||||
* Corporation. All rights reserved.
|
* Corporation. All rights reserved.
|
||||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -40,6 +40,9 @@
|
|||||||
#include "opal/mca/memchecker/base/base.h"
|
#include "opal/mca/memchecker/base/base.h"
|
||||||
#include "opal/dss/dss.h"
|
#include "opal/dss/dss.h"
|
||||||
#include "opal/mca/carto/base/base.h"
|
#include "opal/mca/carto/base/base.h"
|
||||||
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
#endif
|
||||||
|
|
||||||
#include "opal/runtime/opal_cr.h"
|
#include "opal/runtime/opal_cr.h"
|
||||||
#include "opal/mca/crs/base/base.h"
|
#include "opal/mca/crs/base/base.h"
|
||||||
@ -425,6 +428,23 @@ opal_init(int* pargc, char*** pargv)
|
|||||||
/* we want to tick the event library whenever possible */
|
/* we want to tick the event library whenever possible */
|
||||||
opal_progress_event_users_increment();
|
opal_progress_event_users_increment();
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_FT_CR == 1
|
||||||
|
/*
|
||||||
|
* Initialize the compression framework
|
||||||
|
* Note: Currently only used in C/R so it has been marked to only
|
||||||
|
* initialize when C/R is enabled. If other places in the code
|
||||||
|
* wish to use this framework, it is safe to remove the protection.
|
||||||
|
*/
|
||||||
|
if( OPAL_SUCCESS != (ret = opal_compress_base_open()) ) {
|
||||||
|
error = "opal_compress_base_open() failed";
|
||||||
|
goto return_error;
|
||||||
|
}
|
||||||
|
if( OPAL_SUCCESS != (ret = opal_compress_base_select()) ) {
|
||||||
|
error = "opal_compress_base_select() failed";
|
||||||
|
goto return_error;
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Initalize the checkpoint/restart functionality
|
* Initalize the checkpoint/restart functionality
|
||||||
* Note: Always do this so we can detect if the user
|
* Note: Always do this so we can detect if the user
|
||||||
|
@ -21,7 +21,7 @@
|
|||||||
# This is the US/English help file for Open MPI checkpoint tool
|
# This is the US/English help file for Open MPI checkpoint tool
|
||||||
#
|
#
|
||||||
[usage]
|
[usage]
|
||||||
opal-restart FILENAME
|
opal-restart -r FILENAME
|
||||||
Open PAL Single Process Restart Tool
|
Open PAL Single Process Restart Tool
|
||||||
|
|
||||||
%s
|
%s
|
||||||
@ -70,3 +70,10 @@ Error: The restart command failed to properly exec the process per
|
|||||||
|
|
||||||
Expected Component: %s
|
Expected Component: %s
|
||||||
Selected Component: %s
|
Selected Component: %s
|
||||||
|
|
||||||
|
[cache_not_avail]
|
||||||
|
Warning: Recommended cache directory could not be accessed. Falling back
|
||||||
|
to the snapshot location.
|
||||||
|
Cache Dir : %s
|
||||||
|
Snapshot Dir: %s
|
||||||
|
|
||||||
|
@ -61,6 +61,7 @@
|
|||||||
#include "opal/util/show_help.h"
|
#include "opal/util/show_help.h"
|
||||||
#include "opal/util/output.h"
|
#include "opal/util/output.h"
|
||||||
#include "opal/util/opal_environ.h"
|
#include "opal/util/opal_environ.h"
|
||||||
|
#include "opal/util/basename.h"
|
||||||
#include "opal/mca/base/base.h"
|
#include "opal/mca/base/base.h"
|
||||||
#include "opal/mca/base/mca_base_param.h"
|
#include "opal/mca/base/mca_base_param.h"
|
||||||
|
|
||||||
@ -70,14 +71,17 @@
|
|||||||
#include "opal/mca/crs/crs.h"
|
#include "opal/mca/crs/crs.h"
|
||||||
#include "opal/mca/crs/base/base.h"
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
|
||||||
|
#include "opal/mca/compress/compress.h"
|
||||||
|
#include "opal/mca/compress/base/base.h"
|
||||||
|
|
||||||
/******************
|
/******************
|
||||||
* Local Functions
|
* Local Functions
|
||||||
******************/
|
******************/
|
||||||
static int initialize(int argc, char *argv[]);
|
static int initialize(int argc, char *argv[]);
|
||||||
static int finalize(void);
|
static int finalize(void);
|
||||||
static int parse_args(int argc, char *argv[]);
|
static int parse_args(int argc, char *argv[]);
|
||||||
static int check_file(char *given_filename);
|
static int check_file(void);
|
||||||
static int post_env_vars(int prev_pid, char *location);
|
static int post_env_vars(int prev_pid, opal_crs_base_snapshot_t *snapshot);
|
||||||
|
|
||||||
/*****************************************
|
/*****************************************
|
||||||
* Global Vars for Command line Arguments
|
* Global Vars for Command line Arguments
|
||||||
@ -86,10 +90,13 @@ static char *expected_crs_comp = NULL;
|
|||||||
|
|
||||||
typedef struct {
|
typedef struct {
|
||||||
bool help;
|
bool help;
|
||||||
char *filename;
|
|
||||||
bool verbose;
|
bool verbose;
|
||||||
bool forked;
|
char *snapshot_ref;
|
||||||
char *snapshot_loc;
|
char *snapshot_loc;
|
||||||
|
char *snapshot_metadata;
|
||||||
|
char *snapshot_cache;
|
||||||
|
char *snapshot_compress;
|
||||||
|
char *snapshot_compress_postfix;
|
||||||
int output;
|
int output;
|
||||||
} opal_restart_globals_t;
|
} opal_restart_globals_t;
|
||||||
|
|
||||||
@ -109,19 +116,40 @@ opal_cmd_line_init_t cmd_line_opts[] = {
|
|||||||
"Be Verbose" },
|
"Be Verbose" },
|
||||||
|
|
||||||
{ NULL, NULL, NULL,
|
{ NULL, NULL, NULL,
|
||||||
'\0', NULL, "fork",
|
'l', NULL, "location",
|
||||||
0,
|
|
||||||
&opal_restart_globals.forked, OPAL_CMD_LINE_TYPE_BOOL,
|
|
||||||
"Fork off a new process which is the restarted process instead of "
|
|
||||||
"replacing opal_restart" },
|
|
||||||
|
|
||||||
{ "crs", "base", "snapshot_dir",
|
|
||||||
'w', NULL, "where",
|
|
||||||
1,
|
1,
|
||||||
&opal_restart_globals.snapshot_loc, OPAL_CMD_LINE_TYPE_STRING,
|
&opal_restart_globals.snapshot_loc, OPAL_CMD_LINE_TYPE_STRING,
|
||||||
"Where to find the checkpoint files. In most cases this is automatically "
|
"Full path to the location of the local snapshot."},
|
||||||
"detected, however if a custom location was specified to opal-checkpoint "
|
|
||||||
"then this argument is meant to match it."},
|
{ NULL, NULL, NULL,
|
||||||
|
'm', NULL, "metadata",
|
||||||
|
1,
|
||||||
|
&opal_restart_globals.snapshot_metadata, OPAL_CMD_LINE_TYPE_STRING,
|
||||||
|
"Relative path (with respect to --location) to the metadata file."},
|
||||||
|
|
||||||
|
{ NULL, NULL, NULL,
|
||||||
|
'r', NULL, "reference",
|
||||||
|
1,
|
||||||
|
&opal_restart_globals.snapshot_ref, OPAL_CMD_LINE_TYPE_STRING,
|
||||||
|
"Local snapshot reference."},
|
||||||
|
|
||||||
|
{ NULL, NULL, NULL,
|
||||||
|
'c', NULL, "cache",
|
||||||
|
1,
|
||||||
|
&opal_restart_globals.snapshot_cache, OPAL_CMD_LINE_TYPE_STRING,
|
||||||
|
"Possible local cache of the snapshot reference."},
|
||||||
|
|
||||||
|
{ NULL, NULL, NULL,
|
||||||
|
'd', NULL, "decompress",
|
||||||
|
1,
|
||||||
|
&opal_restart_globals.snapshot_compress, OPAL_CMD_LINE_TYPE_STRING,
|
||||||
|
"Decompression component to use."},
|
||||||
|
|
||||||
|
{ NULL, NULL, NULL,
|
||||||
|
'p', NULL, "decompress_postfix",
|
||||||
|
1,
|
||||||
|
&opal_restart_globals.snapshot_compress_postfix, OPAL_CMD_LINE_TYPE_STRING,
|
||||||
|
"Decompression component postfix."},
|
||||||
|
|
||||||
/* End of list */
|
/* End of list */
|
||||||
{ NULL, NULL, NULL,
|
{ NULL, NULL, NULL,
|
||||||
@ -151,9 +179,9 @@ main(int argc, char *argv[])
|
|||||||
/*
|
/*
|
||||||
* Check for existence of the file, or program in the case of self
|
* Check for existence of the file, or program in the case of self
|
||||||
*/
|
*/
|
||||||
if( OPAL_SUCCESS != (ret = check_file(opal_restart_globals.filename) )) {
|
if( OPAL_SUCCESS != (ret = check_file() )) {
|
||||||
opal_show_help("help-opal-restart.txt", "invalid_filename", true,
|
opal_show_help("help-opal-restart.txt", "invalid_filename", true,
|
||||||
opal_restart_globals.filename);
|
opal_restart_globals.snapshot_ref);
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
@ -170,19 +198,35 @@ main(int argc, char *argv[])
|
|||||||
* Make sure we are using the correct checkpointer
|
* Make sure we are using the correct checkpointer
|
||||||
*/
|
*/
|
||||||
if(NULL == expected_crs_comp) {
|
if(NULL == expected_crs_comp) {
|
||||||
char * base = NULL;
|
char * full_metadata_path = NULL;
|
||||||
|
FILE * metadata = NULL;
|
||||||
|
|
||||||
base = opal_crs_base_get_snapshot_directory(opal_restart_globals.filename);
|
asprintf(&full_metadata_path, "%s/%s/%s",
|
||||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(base,
|
opal_restart_globals.snapshot_loc,
|
||||||
|
opal_restart_globals.snapshot_ref,
|
||||||
|
opal_restart_globals.snapshot_metadata);
|
||||||
|
if( NULL == (metadata = fopen(full_metadata_path, "r")) ) {
|
||||||
|
opal_show_help("help-opal-restart.txt", "invalid_metadata", true,
|
||||||
|
opal_restart_globals.snapshot_metadata,
|
||||||
|
full_metadata_path);
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(metadata,
|
||||||
&expected_crs_comp,
|
&expected_crs_comp,
|
||||||
&prev_pid)) ) {
|
&prev_pid)) ) {
|
||||||
opal_show_help("help-opal-restart.txt", "invalid_metadata", true,
|
opal_show_help("help-opal-restart.txt", "invalid_metadata", true,
|
||||||
opal_crs_base_metadata_filename, base);
|
opal_restart_globals.snapshot_metadata,
|
||||||
|
full_metadata_path);
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
free(base);
|
free(full_metadata_path);
|
||||||
|
full_metadata_path = NULL;
|
||||||
|
|
||||||
|
fclose(metadata);
|
||||||
|
metadata = NULL;
|
||||||
}
|
}
|
||||||
|
|
||||||
opal_output_verbose(10, opal_restart_globals.output,
|
opal_output_verbose(10, opal_restart_globals.output,
|
||||||
@ -235,21 +279,17 @@ main(int argc, char *argv[])
|
|||||||
* Restart in this process
|
* Restart in this process
|
||||||
******************************/
|
******************************/
|
||||||
opal_output_verbose(10, opal_restart_globals.output,
|
opal_output_verbose(10, opal_restart_globals.output,
|
||||||
"Restarting from file (%s)",
|
"Restarting from file (%s)\n",
|
||||||
opal_restart_globals.filename);
|
opal_restart_globals.snapshot_ref);
|
||||||
if( opal_restart_globals.forked ) {
|
|
||||||
opal_output_verbose(10, opal_restart_globals.output,
|
|
||||||
"\t Forking off a child");
|
|
||||||
} else {
|
|
||||||
opal_output_verbose(10, opal_restart_globals.output,
|
|
||||||
"\t Exec in self");
|
|
||||||
}
|
|
||||||
|
|
||||||
snapshot = OBJ_NEW(opal_crs_base_snapshot_t);
|
snapshot = OBJ_NEW(opal_crs_base_snapshot_t);
|
||||||
snapshot->cold_start = true;
|
snapshot->cold_start = true;
|
||||||
snapshot->reference_name = strdup(opal_restart_globals.filename);
|
asprintf(&(snapshot->snapshot_directory), "%s/%s",
|
||||||
snapshot->local_location = opal_crs_base_get_snapshot_directory(snapshot->reference_name);
|
opal_restart_globals.snapshot_loc,
|
||||||
snapshot->remote_location = strdup(snapshot->local_location);
|
opal_restart_globals.snapshot_ref);
|
||||||
|
asprintf(&(snapshot->metadata_filename), "%s/%s",
|
||||||
|
snapshot->snapshot_directory,
|
||||||
|
opal_restart_globals.snapshot_metadata);
|
||||||
|
|
||||||
/* Since some checkpoint/restart systems don't pass along env vars to the
|
/* Since some checkpoint/restart systems don't pass along env vars to the
|
||||||
* restarted app, we need to take care of that.
|
* restarted app, we need to take care of that.
|
||||||
@ -257,7 +297,7 @@ main(int argc, char *argv[])
|
|||||||
* Included here is the creation of any files or directories that need to be
|
* Included here is the creation of any files or directories that need to be
|
||||||
* created before the process is restarted.
|
* created before the process is restarted.
|
||||||
*/
|
*/
|
||||||
if(OPAL_SUCCESS != (ret = post_env_vars(prev_pid, snapshot->local_location) ) ) {
|
if(OPAL_SUCCESS != (ret = post_env_vars(prev_pid, snapshot) ) ) {
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
@ -266,27 +306,16 @@ main(int argc, char *argv[])
|
|||||||
* Do the actual restart
|
* Do the actual restart
|
||||||
*/
|
*/
|
||||||
ret = opal_crs.crs_restart(snapshot,
|
ret = opal_crs.crs_restart(snapshot,
|
||||||
opal_restart_globals.forked,
|
false,
|
||||||
&child_pid);
|
&child_pid);
|
||||||
|
|
||||||
if (OPAL_SUCCESS != ret) {
|
if (OPAL_SUCCESS != ret) {
|
||||||
opal_show_help("help-opal-restart.txt", "restart_cmd_failure", true,
|
opal_show_help("help-opal-restart.txt", "restart_cmd_failure", true,
|
||||||
opal_restart_globals.filename, ret);
|
opal_restart_globals.snapshot_ref, ret);
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
/* Should never get here, since crs_restart calls exec */
|
||||||
/* If we required it to exec in self, then fail if this function returns. */
|
|
||||||
if(!opal_restart_globals.forked) {
|
|
||||||
opal_show_help("help-opal-restart.txt", "failed-to-exec", true,
|
|
||||||
expected_crs_comp,
|
|
||||||
opal_crs_base_selected_component.base_version.mca_component_name);
|
|
||||||
exit_status = ret;
|
|
||||||
goto cleanup;
|
|
||||||
}
|
|
||||||
|
|
||||||
opal_output_verbose(10, opal_restart_globals.output,
|
|
||||||
"opal_restart: Restarted Child with PID = %d\n", child_pid);
|
|
||||||
|
|
||||||
/***************
|
/***************
|
||||||
* Cleanup
|
* Cleanup
|
||||||
@ -320,8 +349,8 @@ static int initialize(int argc, char *argv[])
|
|||||||
* Parse Command line arguments
|
* Parse Command line arguments
|
||||||
*/
|
*/
|
||||||
if (OPAL_SUCCESS != (ret = parse_args(argc, argv))) {
|
if (OPAL_SUCCESS != (ret = parse_args(argc, argv))) {
|
||||||
goto cleanup;
|
|
||||||
exit_status = ret;
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
@ -345,6 +374,18 @@ static int initialize(int argc, char *argv[])
|
|||||||
free(tmp_env_var);
|
free(tmp_env_var);
|
||||||
tmp_env_var = NULL;
|
tmp_env_var = NULL;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Make sure we select the proper compress component.
|
||||||
|
*/
|
||||||
|
if( NULL != opal_restart_globals.snapshot_compress ) {
|
||||||
|
tmp_env_var = mca_base_param_env_var("compress");
|
||||||
|
opal_setenv(tmp_env_var,
|
||||||
|
opal_restart_globals.snapshot_compress,
|
||||||
|
true, &environ);
|
||||||
|
free(tmp_env_var);
|
||||||
|
tmp_env_var = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Initialize the OPAL layer
|
* Initialize the OPAL layer
|
||||||
*/
|
*/
|
||||||
@ -353,6 +394,72 @@ static int initialize(int argc, char *argv[])
|
|||||||
goto cleanup;
|
goto cleanup;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* If the checkpoint was compressed, then decompress it before continuing
|
||||||
|
*/
|
||||||
|
if( NULL != opal_restart_globals.snapshot_compress ) {
|
||||||
|
char * zip_dir = NULL;
|
||||||
|
char * tmp_str = NULL;
|
||||||
|
|
||||||
|
/* Make sure to clear the selection for the restart,
|
||||||
|
* this way the user can swich compression mechanism
|
||||||
|
* across restart
|
||||||
|
*/
|
||||||
|
tmp_env_var = mca_base_param_env_var("compress");
|
||||||
|
opal_unsetenv(tmp_env_var, &environ);
|
||||||
|
free(tmp_env_var);
|
||||||
|
tmp_env_var = NULL;
|
||||||
|
|
||||||
|
asprintf(&zip_dir, "%s/%s%s",
|
||||||
|
opal_restart_globals.snapshot_loc,
|
||||||
|
opal_restart_globals.snapshot_ref,
|
||||||
|
opal_restart_globals.snapshot_compress_postfix);
|
||||||
|
|
||||||
|
if (0 > (ret = access(zip_dir, F_OK)) ) {
|
||||||
|
opal_output(opal_restart_globals.output,
|
||||||
|
"Error: Unable to access the file [%s]!",
|
||||||
|
zip_dir);
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_output_verbose(10, opal_restart_globals.output,
|
||||||
|
"Decompressing (%s)",
|
||||||
|
zip_dir);
|
||||||
|
|
||||||
|
opal_compress.decompress(zip_dir, &tmp_str);
|
||||||
|
|
||||||
|
if( NULL != zip_dir ) {
|
||||||
|
free(zip_dir);
|
||||||
|
zip_dir = NULL;
|
||||||
|
}
|
||||||
|
if( NULL != tmp_str ) {
|
||||||
|
free(tmp_str);
|
||||||
|
tmp_str = NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* If a cache directory has been suggested, see if it exists
|
||||||
|
*/
|
||||||
|
if( NULL != opal_restart_globals.snapshot_cache ) {
|
||||||
|
if(0 == (ret = access(opal_restart_globals.snapshot_cache, F_OK)) ) {
|
||||||
|
opal_output_verbose(10, opal_restart_globals.output,
|
||||||
|
"Using the cached snapshot (%s) instead of (%s)",
|
||||||
|
opal_restart_globals.snapshot_cache,
|
||||||
|
opal_restart_globals.snapshot_loc);
|
||||||
|
if( NULL != opal_restart_globals.snapshot_loc ) {
|
||||||
|
free(opal_restart_globals.snapshot_loc);
|
||||||
|
opal_restart_globals.snapshot_loc = NULL;
|
||||||
|
}
|
||||||
|
opal_restart_globals.snapshot_loc = opal_dirname(opal_restart_globals.snapshot_cache);
|
||||||
|
} else {
|
||||||
|
opal_show_help("help-opal-restart.txt", "cache_not_avail", true,
|
||||||
|
opal_restart_globals.snapshot_cache,
|
||||||
|
opal_restart_globals.snapshot_loc);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Mark this process as a tool
|
* Mark this process as a tool
|
||||||
*/
|
*/
|
||||||
@ -380,10 +487,13 @@ static int parse_args(int argc, char *argv[])
|
|||||||
char **app_env = NULL, **global_env = NULL;
|
char **app_env = NULL, **global_env = NULL;
|
||||||
|
|
||||||
opal_restart_globals.help = false;
|
opal_restart_globals.help = false;
|
||||||
opal_restart_globals.filename = NULL;
|
|
||||||
opal_restart_globals.verbose = false;
|
opal_restart_globals.verbose = false;
|
||||||
opal_restart_globals.forked = false;
|
opal_restart_globals.snapshot_ref = NULL;
|
||||||
opal_restart_globals.snapshot_loc = NULL;
|
opal_restart_globals.snapshot_loc = NULL;
|
||||||
|
opal_restart_globals.snapshot_metadata = NULL;
|
||||||
|
opal_restart_globals.snapshot_cache = NULL;
|
||||||
|
opal_restart_globals.snapshot_compress = NULL;
|
||||||
|
opal_restart_globals.snapshot_compress_postfix = NULL;
|
||||||
opal_restart_globals.output = 0;
|
opal_restart_globals.output = 0;
|
||||||
|
|
||||||
/* Parse the command line options */
|
/* Parse the command line options */
|
||||||
@ -412,8 +522,7 @@ static int parse_args(int argc, char *argv[])
|
|||||||
* Now start parsing our specific arguments
|
* Now start parsing our specific arguments
|
||||||
*/
|
*/
|
||||||
if (OPAL_SUCCESS != ret ||
|
if (OPAL_SUCCESS != ret ||
|
||||||
opal_restart_globals.help ||
|
opal_restart_globals.help ) {
|
||||||
1 >= argc) {
|
|
||||||
char *args = NULL;
|
char *args = NULL;
|
||||||
args = opal_cmd_line_get_usage_msg(&cmd_line);
|
args = opal_cmd_line_get_usage_msg(&cmd_line);
|
||||||
opal_show_help("help-opal-restart.txt", "usage", true,
|
opal_show_help("help-opal-restart.txt", "usage", true,
|
||||||
@ -424,20 +533,11 @@ static int parse_args(int argc, char *argv[])
|
|||||||
|
|
||||||
/* get the remaining bits */
|
/* get the remaining bits */
|
||||||
opal_cmd_line_get_tail(&cmd_line, &argc, &argv);
|
opal_cmd_line_get_tail(&cmd_line, &argc, &argv);
|
||||||
if ( 1 > argc ) {
|
|
||||||
char *args = NULL;
|
|
||||||
args = opal_cmd_line_get_usage_msg(&cmd_line);
|
|
||||||
opal_show_help("help-opal-restart.txt", "usage", true,
|
|
||||||
args);
|
|
||||||
free(args);
|
|
||||||
return OPAL_ERROR;
|
|
||||||
}
|
|
||||||
|
|
||||||
opal_restart_globals.filename = strdup(argv[0]);
|
if ( NULL == opal_restart_globals.snapshot_ref ||
|
||||||
if ( NULL == opal_restart_globals.filename ||
|
0 >= strlen(opal_restart_globals.snapshot_ref) ) {
|
||||||
0 >= strlen(opal_restart_globals.filename) ) {
|
|
||||||
opal_show_help("help-opal-restart.txt", "invalid_filename", true,
|
opal_show_help("help-opal-restart.txt", "invalid_filename", true,
|
||||||
opal_restart_globals.filename);
|
opal_restart_globals.snapshot_ref);
|
||||||
return OPAL_ERROR;
|
return OPAL_ERROR;
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -445,21 +545,20 @@ static int parse_args(int argc, char *argv[])
|
|||||||
* need to be grouped together.
|
* need to be grouped together.
|
||||||
* Useful in the 'mca crs self' instance.
|
* Useful in the 'mca crs self' instance.
|
||||||
*/
|
*/
|
||||||
if(argc > 1) {
|
if(argc > 0) {
|
||||||
opal_restart_globals.filename = strdup(opal_argv_join(argv, ' '));
|
opal_restart_globals.snapshot_ref = strdup(opal_argv_join(argv, ' '));
|
||||||
}
|
}
|
||||||
|
|
||||||
return OPAL_SUCCESS;
|
return OPAL_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int check_file(char *given_filename)
|
static int check_file(void)
|
||||||
{
|
{
|
||||||
int exit_status = OPAL_SUCCESS;
|
int exit_status = OPAL_SUCCESS;
|
||||||
int ret;
|
int ret;
|
||||||
char * path_to_check = NULL;
|
char * path_to_check = NULL;
|
||||||
char **argv = NULL;
|
|
||||||
|
|
||||||
if(NULL == given_filename) {
|
if(NULL == opal_restart_globals.snapshot_ref) {
|
||||||
opal_output(opal_restart_globals.output,
|
opal_output(opal_restart_globals.output,
|
||||||
"Error: No filename provided!");
|
"Error: No filename provided!");
|
||||||
exit_status = OPAL_ERROR;
|
exit_status = OPAL_ERROR;
|
||||||
@ -469,9 +568,10 @@ static int check_file(char *given_filename)
|
|||||||
/*
|
/*
|
||||||
* Check for the existance of the snapshot handle in the snapshot directory
|
* Check for the existance of the snapshot handle in the snapshot directory
|
||||||
*/
|
*/
|
||||||
path_to_check = opal_crs_base_get_snapshot_directory(given_filename);
|
asprintf(&path_to_check, "%s/%s",
|
||||||
|
opal_restart_globals.snapshot_loc,
|
||||||
|
opal_restart_globals.snapshot_ref);
|
||||||
|
|
||||||
/* Do the check */
|
|
||||||
opal_output_verbose(10, opal_restart_globals.output,
|
opal_output_verbose(10, opal_restart_globals.output,
|
||||||
"Checking for the existence of (%s)",
|
"Checking for the existence of (%s)",
|
||||||
path_to_check);
|
path_to_check);
|
||||||
@ -485,15 +585,15 @@ static int check_file(char *given_filename)
|
|||||||
}
|
}
|
||||||
|
|
||||||
cleanup:
|
cleanup:
|
||||||
if( NULL != path_to_check)
|
if( NULL != path_to_check) {
|
||||||
free(path_to_check);
|
free(path_to_check);
|
||||||
if( NULL != argv)
|
path_to_check = NULL;
|
||||||
opal_argv_free(argv);
|
}
|
||||||
|
|
||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
|
||||||
static int post_env_vars(int prev_pid, char *location)
|
static int post_env_vars(int prev_pid, opal_crs_base_snapshot_t *snapshot)
|
||||||
{
|
{
|
||||||
int ret, exit_status = OPAL_SUCCESS;
|
int ret, exit_status = OPAL_SUCCESS;
|
||||||
char *command = NULL;
|
char *command = NULL;
|
||||||
@ -511,11 +611,10 @@ static int post_env_vars(int prev_pid, char *location)
|
|||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* JJH: Hardcode /tmp to match opal/runtime/opal_cr.c in the application.
|
|
||||||
* This is needed so we can pass the previous environment to the restarted
|
* This is needed so we can pass the previous environment to the restarted
|
||||||
* application process.
|
* application process.
|
||||||
*/
|
*/
|
||||||
asprintf(&proc_file, "/tmp/%s-%d", OPAL_CR_BASE_ENV_NAME, prev_pid);
|
asprintf(&proc_file, "%s/%s-%d", opal_tmp_directory(), OPAL_CR_BASE_ENV_NAME, prev_pid);
|
||||||
asprintf(&command, "env | grep OMPI_ > %s", proc_file);
|
asprintf(&command, "env | grep OMPI_ > %s", proc_file);
|
||||||
|
|
||||||
opal_output_verbose(5, opal_restart_globals.output,
|
opal_output_verbose(5, opal_restart_globals.output,
|
||||||
@ -530,7 +629,14 @@ static int post_env_vars(int prev_pid, char *location)
|
|||||||
/*
|
/*
|
||||||
* Any directories that need to be created
|
* Any directories that need to be created
|
||||||
*/
|
*/
|
||||||
opal_crs_base_metadata_read_token(location, CRS_METADATA_MKDIR, &loc_mkdir);
|
if( NULL == (snapshot->metadata = fopen(snapshot->metadata_filename, "r")) ) {
|
||||||
|
opal_show_help("help-opal-restart.txt", "invalid_metadata", true,
|
||||||
|
opal_restart_globals.snapshot_metadata,
|
||||||
|
snapshot->metadata_filename);
|
||||||
|
exit_status = OPAL_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
opal_crs_base_metadata_read_token(snapshot->metadata, CRS_METADATA_MKDIR, &loc_mkdir);
|
||||||
argc = opal_argv_count(loc_mkdir);
|
argc = opal_argv_count(loc_mkdir);
|
||||||
for( i = 0; i < argc; ++i ) {
|
for( i = 0; i < argc; ++i ) {
|
||||||
if( NULL != command ) {
|
if( NULL != command ) {
|
||||||
@ -555,7 +661,7 @@ static int post_env_vars(int prev_pid, char *location)
|
|||||||
/*
|
/*
|
||||||
* Any files that need to exist
|
* Any files that need to exist
|
||||||
*/
|
*/
|
||||||
opal_crs_base_metadata_read_token(location, CRS_METADATA_TOUCH, &loc_touch);
|
opal_crs_base_metadata_read_token(snapshot->metadata, CRS_METADATA_TOUCH, &loc_touch);
|
||||||
argc = opal_argv_count(loc_touch);
|
argc = opal_argv_count(loc_touch);
|
||||||
for( i = 0; i < argc; ++i ) {
|
for( i = 0; i < argc; ++i ) {
|
||||||
if( NULL != command ) {
|
if( NULL != command ) {
|
||||||
@ -595,5 +701,10 @@ static int post_env_vars(int prev_pid, char *location)
|
|||||||
loc_touch = NULL;
|
loc_touch = NULL;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if( NULL != snapshot->metadata ) {
|
||||||
|
fclose(snapshot->metadata);
|
||||||
|
snapshot->metadata = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
return exit_status;
|
return exit_status;
|
||||||
}
|
}
|
||||||
|
@ -1,6 +1,9 @@
|
|||||||
# -*- shell-script -*-
|
# -*- shell-script -*-
|
||||||
#
|
#
|
||||||
# Copyright (c) 2009-2010 Cisco Systems, Inc. All rights reserved.
|
# Copyright (c) 2009-2010 Cisco Systems, Inc. All rights reserved.
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
# $COPYRIGHT$
|
# $COPYRIGHT$
|
||||||
#
|
#
|
||||||
# Additional copyrights may follow
|
# Additional copyrights may follow
|
||||||
@ -27,6 +30,7 @@ AC_DEFUN([ORTE_CONFIG_FILES],[
|
|||||||
orte/tools/orte-clean/Makefile
|
orte/tools/orte-clean/Makefile
|
||||||
orte/tools/orte-top/Makefile
|
orte/tools/orte-top/Makefile
|
||||||
orte/tools/orte-bootproxy/Makefile
|
orte/tools/orte-bootproxy/Makefile
|
||||||
|
orte/tools/orte-migrate/Makefile
|
||||||
orte/tools/orte-info/Makefile
|
orte/tools/orte-info/Makefile
|
||||||
])
|
])
|
||||||
])
|
])
|
||||||
|
38
orte/mca/errmgr/autor/Makefile.am
Обычный файл
38
orte/mca/errmgr/autor/Makefile.am
Обычный файл
@ -0,0 +1,38 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
dist_pkgdata_DATA = help-orte-errmgr-autor.txt
|
||||||
|
|
||||||
|
sources = \
|
||||||
|
errmgr_autor.h \
|
||||||
|
errmgr_autor_component.c \
|
||||||
|
errmgr_autor_module.c
|
||||||
|
|
||||||
|
# Make the output library in this directory, and name it either
|
||||||
|
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||||
|
# (for static builds).
|
||||||
|
|
||||||
|
if OMPI_BUILD_errmgr_autor_DSO
|
||||||
|
component_noinst =
|
||||||
|
component_install = mca_errmgr_autor.la
|
||||||
|
else
|
||||||
|
component_noinst = libmca_errmgr_autor.la
|
||||||
|
component_install =
|
||||||
|
endif
|
||||||
|
|
||||||
|
mcacomponentdir = $(pkglibdir)
|
||||||
|
mcacomponent_LTLIBRARIES = $(component_install)
|
||||||
|
mca_errmgr_autor_la_SOURCES = $(sources)
|
||||||
|
mca_errmgr_autor_la_LDFLAGS = -module -avoid-version
|
||||||
|
|
||||||
|
noinst_LTLIBRARIES = $(component_noinst)
|
||||||
|
libmca_errmgr_autor_la_SOURCES = $(sources)
|
||||||
|
libmca_errmgr_autor_la_LDFLAGS = -module -avoid-version
|
20
orte/mca/errmgr/autor/configure.m4
Обычный файл
20
orte/mca/errmgr/autor/configure.m4
Обычный файл
@ -0,0 +1,20 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
# MCA_errmgr_autor_CONFIG([action-if-found], [action-if-not-found])
|
||||||
|
# -----------------------------------------------------------
|
||||||
|
AC_DEFUN([MCA_errmgr_autor_CONFIG],[
|
||||||
|
# If we don't want FT, don't compile this component
|
||||||
|
AS_IF([test "$opal_want_ft_cr" = "1"],
|
||||||
|
[$1],
|
||||||
|
[$2])
|
||||||
|
])dnl
|
14
orte/mca/errmgr/autor/configure.params
Обычный файл
14
orte/mca/errmgr/autor/configure.params
Обычный файл
@ -0,0 +1,14 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
PARAM_INIT_FILE=errmgr_autor_component.c
|
||||||
|
PARAM_CONFIG_FILES="Makefile"
|
88
orte/mca/errmgr/autor/errmgr_autor.h
Обычный файл
88
orte/mca/errmgr/autor/errmgr_autor.h
Обычный файл
@ -0,0 +1,88 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file
|
||||||
|
*
|
||||||
|
* Automatic Recovery Errmgr component
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef MCA_ERRMGR_AUTOR_EXPORT_H
|
||||||
|
#define MCA_ERRMGR_AUTOR_EXPORT_H
|
||||||
|
|
||||||
|
#include "orte_config.h"
|
||||||
|
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/event/event.h"
|
||||||
|
|
||||||
|
#include "orte/mca/filem/filem.h"
|
||||||
|
#include "orte/mca/errmgr/errmgr.h"
|
||||||
|
|
||||||
|
BEGIN_C_DECLS
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local Component structures
|
||||||
|
*/
|
||||||
|
struct orte_errmgr_autor_component_t {
|
||||||
|
orte_errmgr_base_component_t super; /** Base Errmgr component */
|
||||||
|
bool autor_enabled;
|
||||||
|
bool timing_enabled;
|
||||||
|
int recovery_delay;
|
||||||
|
bool skip_oldnode;
|
||||||
|
};
|
||||||
|
typedef struct orte_errmgr_autor_component_t orte_errmgr_autor_component_t;
|
||||||
|
OPAL_MODULE_DECLSPEC extern orte_errmgr_autor_component_t mca_errmgr_autor_component;
|
||||||
|
|
||||||
|
int orte_errmgr_autor_component_query(mca_base_module_t **module, int *priority);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Module functions: Global
|
||||||
|
*/
|
||||||
|
int orte_errmgr_autor_global_module_init(void);
|
||||||
|
int orte_errmgr_autor_global_module_finalize(void);
|
||||||
|
|
||||||
|
int orte_errmgr_autor_global_update_state(orte_jobid_t job,
|
||||||
|
orte_job_state_t jobstate,
|
||||||
|
orte_process_name_t *proc_name,
|
||||||
|
orte_proc_state_t state,
|
||||||
|
pid_t pid,
|
||||||
|
orte_exit_code_t exit_code,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
int orte_errmgr_autor_global_process_fault(orte_job_t *jdata,
|
||||||
|
orte_process_name_t *proc_name,
|
||||||
|
orte_proc_state_t state,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
int orte_errmgr_autor_global_suggest_map_targets(orte_proc_t *proc,
|
||||||
|
orte_node_t *oldnode,
|
||||||
|
opal_list_t *node_list,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
int orte_errmgr_autor_global_ft_event(int state);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Module functions: Local (Daemon)
|
||||||
|
*/
|
||||||
|
int orte_errmgr_autor_local_module_init(void);
|
||||||
|
int orte_errmgr_autor_local_module_finalize(void);
|
||||||
|
|
||||||
|
int orte_errmgr_autor_local_update_state(orte_jobid_t job,
|
||||||
|
orte_job_state_t jobstate,
|
||||||
|
orte_process_name_t *proc_name,
|
||||||
|
orte_proc_state_t state,
|
||||||
|
pid_t pid,
|
||||||
|
orte_exit_code_t exit_code,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
int orte_errmgr_autor_local_ft_event(int state);
|
||||||
|
|
||||||
|
|
||||||
|
END_C_DECLS
|
||||||
|
|
||||||
|
#endif /* MCA_ERRMGR_AUTOR_EXPORT_H */
|
161
orte/mca/errmgr/autor/errmgr_autor_component.c
Обычный файл
161
orte/mca/errmgr/autor/errmgr_autor_component.c
Обычный файл
@ -0,0 +1,161 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "orte_config.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
|
||||||
|
#include "orte/mca/errmgr/errmgr.h"
|
||||||
|
#include "orte/mca/errmgr/base/base.h"
|
||||||
|
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||||
|
#include "errmgr_autor.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Public string for version number
|
||||||
|
*/
|
||||||
|
const char *orte_errmgr_autor_component_version_string =
|
||||||
|
"ORTE ERRMGR AutoR MCA component version " ORTE_VERSION;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local functionality
|
||||||
|
*/
|
||||||
|
static int errmgr_autor_open(void);
|
||||||
|
static int errmgr_autor_close(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Instantiate the public struct with all of our public information
|
||||||
|
* and pointer to our public functions in it
|
||||||
|
*/
|
||||||
|
orte_errmgr_autor_component_t mca_errmgr_autor_component = {
|
||||||
|
/* First do the base component stuff */
|
||||||
|
{
|
||||||
|
/* Handle the general mca_component_t struct containing
|
||||||
|
* meta information about the component itautor
|
||||||
|
*/
|
||||||
|
{
|
||||||
|
ORTE_ERRMGR_BASE_VERSION_3_0_0,
|
||||||
|
/* Component name and version */
|
||||||
|
"autor",
|
||||||
|
ORTE_MAJOR_VERSION,
|
||||||
|
ORTE_MINOR_VERSION,
|
||||||
|
ORTE_RELEASE_VERSION,
|
||||||
|
|
||||||
|
/* Component open and close functions */
|
||||||
|
errmgr_autor_open,
|
||||||
|
errmgr_autor_close,
|
||||||
|
orte_errmgr_autor_component_query
|
||||||
|
},
|
||||||
|
{
|
||||||
|
/* The component is checkpoint ready */
|
||||||
|
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||||
|
},
|
||||||
|
|
||||||
|
/* Verbosity level */
|
||||||
|
0,
|
||||||
|
/* opal_output handler */
|
||||||
|
-1,
|
||||||
|
/* Default priority */
|
||||||
|
20
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
static int errmgr_autor_open(void)
|
||||||
|
{
|
||||||
|
int val;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* This should be the last componet to ever get used since
|
||||||
|
* it doesn't do anything.
|
||||||
|
*/
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||||
|
"priority",
|
||||||
|
"Priority of the ERRMGR autor component",
|
||||||
|
false, false,
|
||||||
|
mca_errmgr_autor_component.super.priority,
|
||||||
|
&mca_errmgr_autor_component.super.priority);
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||||
|
"verbose",
|
||||||
|
"Verbose level for the ERRMGR autor component",
|
||||||
|
false, false,
|
||||||
|
mca_errmgr_autor_component.super.verbose,
|
||||||
|
&mca_errmgr_autor_component.super.verbose);
|
||||||
|
/* If there is a custom verbose level for this component than use it
|
||||||
|
* otherwise take our parents level and output channel
|
||||||
|
*/
|
||||||
|
if ( 0 != mca_errmgr_autor_component.super.verbose) {
|
||||||
|
mca_errmgr_autor_component.super.output_handle = opal_output_open(NULL);
|
||||||
|
opal_output_set_verbosity(mca_errmgr_autor_component.super.output_handle,
|
||||||
|
mca_errmgr_autor_component.super.verbose);
|
||||||
|
} else {
|
||||||
|
mca_errmgr_autor_component.super.output_handle = orte_errmgr_base.output;
|
||||||
|
}
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||||
|
"timing",
|
||||||
|
"Enable Automatic Recovery timer",
|
||||||
|
false, false,
|
||||||
|
0, &val);
|
||||||
|
mca_errmgr_autor_component.timing_enabled = OPAL_INT_TO_BOOL(val);
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||||
|
"enable",
|
||||||
|
"Enable Automatic Recovery (Default: 0/off)",
|
||||||
|
false, false,
|
||||||
|
0, &val);
|
||||||
|
mca_errmgr_autor_component.autor_enabled = OPAL_INT_TO_BOOL(val);
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||||
|
"recovery_delay",
|
||||||
|
"Number of seconds to wait before starting to recover the job after a failure"
|
||||||
|
" [Default: 1 sec]",
|
||||||
|
false, false,
|
||||||
|
1, &val);
|
||||||
|
mca_errmgr_autor_component.recovery_delay = val;
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||||
|
"skip_oldnode",
|
||||||
|
"Skip the old node from failed proc, even if it is still available"
|
||||||
|
" [Default: Enabled]",
|
||||||
|
false, false,
|
||||||
|
1, &val);
|
||||||
|
mca_errmgr_autor_component.skip_oldnode = OPAL_INT_TO_BOOL(val);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Debug Output
|
||||||
|
*/
|
||||||
|
opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle,
|
||||||
|
"errmgr:autor: open()");
|
||||||
|
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||||
|
"errmgr:autor: open: priority = %d",
|
||||||
|
mca_errmgr_autor_component.super.priority);
|
||||||
|
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||||
|
"errmgr:autor: open: verbosity = %d",
|
||||||
|
mca_errmgr_autor_component.super.verbose);
|
||||||
|
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||||
|
"errmgr:autor: open: timing = %s",
|
||||||
|
(mca_errmgr_autor_component.timing_enabled ? "Enabled" : "Disabled"));
|
||||||
|
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||||
|
"errmgr:autor: open: Auto. Recover = %s",
|
||||||
|
(mca_errmgr_autor_component.autor_enabled ? "Enabled" : "Disabled"));
|
||||||
|
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||||
|
"errmgr:autor: open: recover_delay = %d",
|
||||||
|
mca_errmgr_autor_component.recovery_delay);
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
static int errmgr_autor_close(void)
|
||||||
|
{
|
||||||
|
opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle,
|
||||||
|
"errmgr:autor: close()");
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
1194
orte/mca/errmgr/autor/errmgr_autor_module.c
Обычный файл
1194
orte/mca/errmgr/autor/errmgr_autor_module.c
Обычный файл
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
28
orte/mca/errmgr/autor/help-orte-errmgr-autor.txt
Обычный файл
28
orte/mca/errmgr/autor/help-orte-errmgr-autor.txt
Обычный файл
@ -0,0 +1,28 @@
|
|||||||
|
-*- text -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
# This is the US/English general help file for ORTE ErrMgr AutoR framework.
|
||||||
|
#
|
||||||
|
[recovering_job]
|
||||||
|
Notice: The processes listed below failed unexpectedly.
|
||||||
|
Using the last checkpoint to recover the job.
|
||||||
|
Please standby.
|
||||||
|
%s
|
||||||
|
[recovery_complete]
|
||||||
|
Notice: The job has been successfully recovered from the
|
||||||
|
last checkpoint.
|
||||||
|
[failed_to_recover_proc]
|
||||||
|
Error: The process below has failed. There is no checkpoint available for
|
||||||
|
this job, so we are terminating the application since automatic
|
||||||
|
recovery cannot occur.
|
||||||
|
Internal Name: %s
|
||||||
|
MCW Rank: %d
|
@ -1,5 +1,5 @@
|
|||||||
#
|
#
|
||||||
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||||
# University Research and Technology
|
# University Research and Technology
|
||||||
# Corporation. All rights reserved.
|
# Corporation. All rights reserved.
|
||||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||||
@ -24,4 +24,5 @@ libmca_errmgr_la_SOURCES += \
|
|||||||
base/errmgr_base_close.c \
|
base/errmgr_base_close.c \
|
||||||
base/errmgr_base_select.c \
|
base/errmgr_base_select.c \
|
||||||
base/errmgr_base_open.c \
|
base/errmgr_base_open.c \
|
||||||
base/errmgr_base_fns.c
|
base/errmgr_base_fns.c \
|
||||||
|
base/errmgr_base_tool.c
|
||||||
|
@ -30,6 +30,7 @@
|
|||||||
#include "opal/class/opal_list.h"
|
#include "opal/class/opal_list.h"
|
||||||
|
|
||||||
#include "opal/mca/mca.h"
|
#include "opal/mca/mca.h"
|
||||||
|
#include "orte/mca/snapc/base/base.h"
|
||||||
#include "orte/mca/errmgr/errmgr.h"
|
#include "orte/mca/errmgr/errmgr.h"
|
||||||
|
|
||||||
|
|
||||||
@ -56,6 +57,51 @@ ORTE_DECLSPEC int orte_errmgr_base_close(void);
|
|||||||
*/
|
*/
|
||||||
ORTE_DECLSPEC extern opal_list_t orte_errmgr_base_components_available;
|
ORTE_DECLSPEC extern opal_list_t orte_errmgr_base_components_available;
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Interfaces for orte-migrate tool
|
||||||
|
*/
|
||||||
|
#if OPAL_ENABLE_FT_CR
|
||||||
|
/**
|
||||||
|
* Migrating States
|
||||||
|
*/
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_ERROR (ORTE_SNAPC_CKPT_MAX + 1)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS (ORTE_SNAPC_CKPT_MAX + 2)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_NONE (ORTE_SNAPC_CKPT_MAX + 3)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_REQUEST (ORTE_SNAPC_CKPT_MAX + 4)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_RUNNING (ORTE_SNAPC_CKPT_MAX + 5)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_RUN_CKPT (ORTE_SNAPC_CKPT_MAX + 6)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_STARTUP (ORTE_SNAPC_CKPT_MAX + 7)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_STATE_FINISH (ORTE_SNAPC_CKPT_MAX + 8)
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_MAX (ORTE_SNAPC_CKPT_MAX + 9)
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Commands for command line tool and ErrMgr interaction
|
||||||
|
*/
|
||||||
|
typedef uint8_t orte_errmgr_tool_cmd_flag_t;
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_TOOL_CMD OPAL_UINT8
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD 1
|
||||||
|
#define ORTE_ERRMGR_MIGRATE_TOOL_UPDATE_CMD 2
|
||||||
|
|
||||||
|
/* Initialize/Finalize the orte-migrate communication functionality */
|
||||||
|
ORTE_DECLSPEC int orte_errmgr_base_tool_init(void);
|
||||||
|
ORTE_DECLSPEC int orte_errmgr_base_tool_finalize(void);
|
||||||
|
|
||||||
|
ORTE_DECLSPEC int orte_errmgr_base_migrate_state_str(char ** state_str, int state);
|
||||||
|
|
||||||
|
ORTE_DECLSPEC int orte_errmgr_base_migrate_update(int status);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Interfaces for C/R related recovery
|
||||||
|
*/
|
||||||
|
ORTE_DECLSPEC int orte_errmgr_base_update_app_context_for_cr_recovery(orte_job_t *jobdata,
|
||||||
|
orte_proc_t *proc,
|
||||||
|
opal_list_t *local_snapshots);
|
||||||
|
|
||||||
|
ORTE_DECLSPEC int orte_errmgr_base_restart_job(orte_jobid_t jobid, char * global_handle, int seq_num);
|
||||||
|
ORTE_DECLSPEC int orte_errmgr_base_migrate_job(orte_jobid_t jobid, orte_snapc_base_request_op_t *datum);
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Additional External API function declared in errmgr.h
|
* Additional External API function declared in errmgr.h
|
||||||
*/
|
*/
|
||||||
|
@ -21,27 +21,157 @@
|
|||||||
#include "orte_config.h"
|
#include "orte_config.h"
|
||||||
#include "orte/constants.h"
|
#include "orte/constants.h"
|
||||||
|
|
||||||
|
#ifdef HAVE_STRING_H
|
||||||
|
#include <string.h>
|
||||||
|
#endif
|
||||||
|
#if HAVE_SYS_TYPES_H
|
||||||
|
#include <sys/types.h>
|
||||||
|
#endif /* HAVE_SYS_TYPES_H */
|
||||||
#ifdef HAVE_UNISTD_H
|
#ifdef HAVE_UNISTD_H
|
||||||
#include <unistd.h>
|
#include <unistd.h>
|
||||||
#endif
|
#endif /* HAVE_UNISTD_H */
|
||||||
|
#if HAVE_SYS_TYPES_H
|
||||||
|
#include <sys/types.h>
|
||||||
|
#endif /* HAVE_SYS_TYPES_H */
|
||||||
|
#if HAVE_SYS_STAT_H
|
||||||
|
#include <sys/stat.h>
|
||||||
|
#endif /* HAVE_SYS_STAT_H */
|
||||||
|
#ifdef HAVE_DIRENT_H
|
||||||
|
#include <dirent.h>
|
||||||
|
#endif /* HAVE_DIRENT_H */
|
||||||
|
#include <time.h>
|
||||||
|
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <stdarg.h>
|
#include <stdarg.h>
|
||||||
|
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/base/base.h"
|
||||||
|
#include "opal/mca/base/mca_base_param.h"
|
||||||
#include "opal/util/trace.h"
|
#include "opal/util/trace.h"
|
||||||
|
#include "opal/util/os_dirpath.h"
|
||||||
#include "opal/util/output.h"
|
#include "opal/util/output.h"
|
||||||
|
#include "opal/util/basename.h"
|
||||||
|
#include "opal/util/argv.h"
|
||||||
|
#include "opal/mca/crs/crs.h"
|
||||||
|
#include "opal/mca/crs/base/base.h"
|
||||||
#include "opal/util/opal_sos.h"
|
#include "opal/util/opal_sos.h"
|
||||||
|
|
||||||
#include "orte/util/name_fns.h"
|
#include "orte/util/name_fns.h"
|
||||||
#include "orte/util/session_dir.h"
|
#include "orte/util/session_dir.h"
|
||||||
|
|
||||||
|
#include "orte/runtime/orte_globals.h"
|
||||||
|
#include "orte/runtime/runtime.h"
|
||||||
|
#include "orte/runtime/orte_wait.h"
|
||||||
|
#include "orte/runtime/orte_locks.h"
|
||||||
|
|
||||||
#include "orte/mca/ess/ess.h"
|
#include "orte/mca/ess/ess.h"
|
||||||
#include "orte/mca/odls/odls.h"
|
#include "orte/mca/odls/odls.h"
|
||||||
|
#include "orte/mca/plm/plm.h"
|
||||||
|
#include "orte/mca/rml/rml.h"
|
||||||
|
#include "orte/mca/rml/rml_types.h"
|
||||||
#include "orte/mca/routed/routed.h"
|
#include "orte/mca/routed/routed.h"
|
||||||
#include "orte/runtime/orte_globals.h"
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
#include "orte/mca/snapc/base/base.h"
|
||||||
|
#include "orte/mca/sstore/sstore.h"
|
||||||
|
#include "orte/mca/sstore/base/base.h"
|
||||||
|
|
||||||
#include "orte/mca/errmgr/errmgr.h"
|
#include "orte/mca/errmgr/errmgr.h"
|
||||||
#include "orte/mca/errmgr/base/base.h"
|
#include "orte/mca/errmgr/base/base.h"
|
||||||
#include "orte/mca/errmgr/base/errmgr_private.h"
|
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Object stuff
|
||||||
|
*/
|
||||||
|
void orte_errmgr_predicted_proc_construct(orte_errmgr_predicted_proc_t *item);
|
||||||
|
void orte_errmgr_predicted_proc_destruct( orte_errmgr_predicted_proc_t *item);
|
||||||
|
|
||||||
|
OBJ_CLASS_INSTANCE(orte_errmgr_predicted_proc_t,
|
||||||
|
opal_list_item_t,
|
||||||
|
orte_errmgr_predicted_proc_construct,
|
||||||
|
orte_errmgr_predicted_proc_destruct);
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_proc_construct(orte_errmgr_predicted_proc_t *item)
|
||||||
|
{
|
||||||
|
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||||
|
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||||
|
}
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_proc_destruct( orte_errmgr_predicted_proc_t *item)
|
||||||
|
{
|
||||||
|
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||||
|
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||||
|
}
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_node_construct(orte_errmgr_predicted_node_t *item);
|
||||||
|
void orte_errmgr_predicted_node_destruct( orte_errmgr_predicted_node_t *item);
|
||||||
|
|
||||||
|
OBJ_CLASS_INSTANCE(orte_errmgr_predicted_node_t,
|
||||||
|
opal_list_item_t,
|
||||||
|
orte_errmgr_predicted_node_construct,
|
||||||
|
orte_errmgr_predicted_node_destruct);
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_node_construct(orte_errmgr_predicted_node_t *item)
|
||||||
|
{
|
||||||
|
item->node_name = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_node_destruct( orte_errmgr_predicted_node_t *item)
|
||||||
|
{
|
||||||
|
if( NULL != item->node_name ) {
|
||||||
|
free(item->node_name);
|
||||||
|
item->node_name = NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_map_construct(orte_errmgr_predicted_map_t *item);
|
||||||
|
void orte_errmgr_predicted_map_destruct( orte_errmgr_predicted_map_t *item);
|
||||||
|
|
||||||
|
OBJ_CLASS_INSTANCE(orte_errmgr_predicted_map_t,
|
||||||
|
opal_list_item_t,
|
||||||
|
orte_errmgr_predicted_map_construct,
|
||||||
|
orte_errmgr_predicted_map_destruct);
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_map_construct(orte_errmgr_predicted_map_t *item)
|
||||||
|
{
|
||||||
|
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||||
|
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||||
|
|
||||||
|
item->node_name = NULL;
|
||||||
|
|
||||||
|
item->map_proc_name.vpid = ORTE_VPID_INVALID;
|
||||||
|
item->map_proc_name.jobid = ORTE_JOBID_INVALID;
|
||||||
|
|
||||||
|
item->map_node_name = NULL;
|
||||||
|
item->off_current_node = false;
|
||||||
|
item->pre_map_fixed_node = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
void orte_errmgr_predicted_map_destruct( orte_errmgr_predicted_map_t *item)
|
||||||
|
{
|
||||||
|
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||||
|
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||||
|
|
||||||
|
if( NULL != item->node_name ) {
|
||||||
|
free(item->node_name);
|
||||||
|
item->node_name = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
item->map_proc_name.vpid = ORTE_VPID_INVALID;
|
||||||
|
item->map_proc_name.jobid = ORTE_JOBID_INVALID;
|
||||||
|
|
||||||
|
if( NULL != item->map_node_name ) {
|
||||||
|
free(item->map_node_name);
|
||||||
|
item->map_node_name = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
item->off_current_node = false;
|
||||||
|
|
||||||
|
if( NULL != item->pre_map_fixed_node ) {
|
||||||
|
free(item->pre_map_fixed_node);
|
||||||
|
item->pre_map_fixed_node = NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Public interfaces
|
* Public interfaces
|
||||||
*/
|
*/
|
||||||
@ -135,9 +265,9 @@ int orte_errmgr_base_abort(int error_code, char *fmt, ...)
|
|||||||
return ORTE_SUCCESS;
|
return ORTE_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
int orte_errmgr_base_predicted_fault(char ***proc_list,
|
int orte_errmgr_base_predicted_fault(opal_list_t *proc_list,
|
||||||
char ***node_list,
|
opal_list_t *node_list,
|
||||||
char ***suggested_nodes)
|
opal_list_t *suggested_map)
|
||||||
{
|
{
|
||||||
orte_errmgr_base_module_t *module = NULL;
|
orte_errmgr_base_module_t *module = NULL;
|
||||||
int i, rc;
|
int i, rc;
|
||||||
@ -155,7 +285,7 @@ int orte_errmgr_base_predicted_fault(char ***proc_list,
|
|||||||
continue;
|
continue;
|
||||||
}
|
}
|
||||||
if( NULL != module->predicted_fault ) {
|
if( NULL != module->predicted_fault ) {
|
||||||
rc = module->predicted_fault(proc_list, node_list, suggested_nodes, &stack_state);
|
rc = module->predicted_fault(proc_list, node_list, suggested_map, &stack_state);
|
||||||
if (ORTE_SUCCESS != rc || ORTE_ERRMGR_STACK_STATE_COMPLETE & stack_state) {
|
if (ORTE_SUCCESS != rc || ORTE_ERRMGR_STACK_STATE_COMPLETE & stack_state) {
|
||||||
break;
|
break;
|
||||||
}
|
}
|
||||||
@ -218,3 +348,348 @@ int orte_errmgr_base_ft_event(int state)
|
|||||||
|
|
||||||
return ORTE_SUCCESS;
|
return ORTE_SUCCESS;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/********************
|
||||||
|
* Utility functions
|
||||||
|
********************/
|
||||||
|
#if OPAL_ENABLE_FT_CR
|
||||||
|
int orte_errmgr_base_migrate_state_str(char ** state_str, int state)
|
||||||
|
{
|
||||||
|
switch(state) {
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_NONE:
|
||||||
|
*state_str = strdup(" -- ");
|
||||||
|
break;
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_REQUEST:
|
||||||
|
*state_str = strdup("Requested");
|
||||||
|
break;
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_RUNNING:
|
||||||
|
*state_str = strdup("Running");
|
||||||
|
break;
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_RUN_CKPT:
|
||||||
|
*state_str = strdup("Checkpointing");
|
||||||
|
break;
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_STARTUP:
|
||||||
|
*state_str = strdup("Restarting");
|
||||||
|
break;
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_FINISH:
|
||||||
|
*state_str = strdup("Finished");
|
||||||
|
break;
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_ERROR:
|
||||||
|
*state_str = strdup("Error");
|
||||||
|
break;
|
||||||
|
case ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS:
|
||||||
|
*state_str = strdup("Error: Migration in progress");
|
||||||
|
break;
|
||||||
|
default:
|
||||||
|
asprintf(state_str, "Unknown %d", state);
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_FT_CR
|
||||||
|
int orte_errmgr_base_update_app_context_for_cr_recovery(orte_job_t *jobdata,
|
||||||
|
orte_proc_t *proc,
|
||||||
|
opal_list_t *local_snapshots)
|
||||||
|
{
|
||||||
|
int ret, exit_status = ORTE_SUCCESS;
|
||||||
|
opal_list_item_t *item = NULL;
|
||||||
|
orte_std_cntr_t i_app;
|
||||||
|
int argc = 0;
|
||||||
|
orte_app_context_t *cur_app_context = NULL;
|
||||||
|
orte_app_context_t *new_app_context = NULL;
|
||||||
|
orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL;
|
||||||
|
char *reference_fmt_str = NULL;
|
||||||
|
char *location_str = NULL;
|
||||||
|
char *cache_location_str = NULL;
|
||||||
|
char *ref_location_fmt_str = NULL;
|
||||||
|
char *tmp_str = NULL;
|
||||||
|
char *global_snapshot_ref = NULL;
|
||||||
|
char *global_snapshot_seq = NULL;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Get the snapshot restart command for this process
|
||||||
|
* JJH CLEANUP: Pass in the vpid_snapshot, so we don't have to look it up every time?
|
||||||
|
*/
|
||||||
|
for(item = opal_list_get_first(local_snapshots);
|
||||||
|
item != opal_list_get_end(local_snapshots);
|
||||||
|
item = opal_list_get_next(item) ) {
|
||||||
|
vpid_snapshot = (orte_sstore_base_local_snapshot_info_t*)item;
|
||||||
|
if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL,
|
||||||
|
&vpid_snapshot->process_name,
|
||||||
|
&proc->name) ) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
vpid_snapshot = NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL == vpid_snapshot ) {
|
||||||
|
ORTE_ERROR_LOG(ORTE_ERROR);
|
||||||
|
exit_status = ORTE_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||||
|
SSTORE_METADATA_LOCAL_SNAP_REF_FMT,
|
||||||
|
&reference_fmt_str);
|
||||||
|
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||||
|
SSTORE_METADATA_LOCAL_SNAP_LOC,
|
||||||
|
&location_str);
|
||||||
|
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||||
|
SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT,
|
||||||
|
&ref_location_fmt_str);
|
||||||
|
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||||
|
SSTORE_METADATA_GLOBAL_SNAP_REF,
|
||||||
|
&global_snapshot_ref);
|
||||||
|
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||||
|
SSTORE_METADATA_GLOBAL_SNAP_SEQ,
|
||||||
|
&global_snapshot_seq);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Find current app_context
|
||||||
|
*/
|
||||||
|
cur_app_context = NULL;
|
||||||
|
for(i_app = 0; i_app < opal_pointer_array_get_size(jobdata->apps); ++i_app) {
|
||||||
|
cur_app_context = (orte_app_context_t *)opal_pointer_array_get_item(jobdata->apps,
|
||||||
|
i_app);
|
||||||
|
if( NULL == cur_app_context ) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
if(proc->app_idx == cur_app_context->idx) {
|
||||||
|
break;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if( NULL == cur_app_context ) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* if > 1 processes in this app context
|
||||||
|
* Create a new app_context
|
||||||
|
* Copy over attributes
|
||||||
|
* Add it to the job_t data structure
|
||||||
|
* Associate it with this process in the job
|
||||||
|
* else
|
||||||
|
* Reuse this app_context
|
||||||
|
*/
|
||||||
|
if( cur_app_context->num_procs > 1 ) {
|
||||||
|
/* Create a new app_context */
|
||||||
|
new_app_context = OBJ_NEW(orte_app_context_t);
|
||||||
|
|
||||||
|
/* Copy over attributes */
|
||||||
|
new_app_context->idx = cur_app_context->idx;
|
||||||
|
new_app_context->app = NULL; /* strdup(cur_app_context->app); */
|
||||||
|
new_app_context->num_procs = 1;
|
||||||
|
new_app_context->argv = NULL; /* opal_argv_copy(cur_app_context->argv); */
|
||||||
|
new_app_context->env = opal_argv_copy(cur_app_context->env);
|
||||||
|
new_app_context->cwd = (NULL == cur_app_context->cwd ? NULL :
|
||||||
|
strdup(cur_app_context->cwd));
|
||||||
|
new_app_context->user_specified_cwd = cur_app_context->user_specified_cwd;
|
||||||
|
new_app_context->hostfile = (NULL == cur_app_context->hostfile ? NULL :
|
||||||
|
strdup(cur_app_context->hostfile));
|
||||||
|
new_app_context->add_hostfile = (NULL == cur_app_context->add_hostfile ? NULL :
|
||||||
|
strdup(cur_app_context->add_hostfile));
|
||||||
|
new_app_context->dash_host = opal_argv_copy(cur_app_context->dash_host);
|
||||||
|
new_app_context->prefix_dir = (NULL == cur_app_context->prefix_dir ? NULL :
|
||||||
|
strdup(cur_app_context->prefix_dir));
|
||||||
|
new_app_context->preload_binary = false;
|
||||||
|
new_app_context->preload_libs = false;
|
||||||
|
new_app_context->preload_files_dest_dir = NULL;
|
||||||
|
new_app_context->preload_files_src_dir = NULL;
|
||||||
|
|
||||||
|
asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid);
|
||||||
|
asprintf(&(new_app_context->sstore_load),
|
||||||
|
"%s:%s:%s:%s:%s:%s",
|
||||||
|
location_str,
|
||||||
|
global_snapshot_ref,
|
||||||
|
tmp_str,
|
||||||
|
(vpid_snapshot->compress_comp == NULL ? "" : vpid_snapshot->compress_comp),
|
||||||
|
(vpid_snapshot->compress_postfix == NULL ? "" : vpid_snapshot->compress_postfix),
|
||||||
|
global_snapshot_seq);
|
||||||
|
|
||||||
|
new_app_context->used_on_node = cur_app_context->used_on_node;
|
||||||
|
|
||||||
|
/* Add it to the job_t data structure */
|
||||||
|
/*current_global_jobdata->num_apps++; */
|
||||||
|
new_app_context->idx = (jobdata->num_apps);
|
||||||
|
proc->app_idx = new_app_context->idx;
|
||||||
|
|
||||||
|
opal_pointer_array_add(jobdata->apps, new_app_context);
|
||||||
|
++(jobdata->num_apps);
|
||||||
|
|
||||||
|
/* Remove association with the old app_context */
|
||||||
|
--(cur_app_context->num_procs);
|
||||||
|
}
|
||||||
|
else {
|
||||||
|
new_app_context = cur_app_context;
|
||||||
|
|
||||||
|
/* Cleanout old stuff */
|
||||||
|
free(new_app_context->app);
|
||||||
|
new_app_context->app = NULL;
|
||||||
|
|
||||||
|
opal_argv_free(new_app_context->argv);
|
||||||
|
new_app_context->argv = NULL;
|
||||||
|
|
||||||
|
asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid);
|
||||||
|
asprintf(&(new_app_context->sstore_load),
|
||||||
|
"%s:%s:%s:%s:%s:%s",
|
||||||
|
location_str,
|
||||||
|
global_snapshot_ref,
|
||||||
|
tmp_str,
|
||||||
|
(vpid_snapshot->compress_comp == NULL ? "" : vpid_snapshot->compress_comp),
|
||||||
|
(vpid_snapshot->compress_postfix == NULL ? "" : vpid_snapshot->compress_postfix),
|
||||||
|
global_snapshot_seq);
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Update the app_context with the restart informaiton
|
||||||
|
*/
|
||||||
|
new_app_context->app = strdup("opal-restart");
|
||||||
|
opal_argv_append(&argc, &(new_app_context->argv), new_app_context->app);
|
||||||
|
opal_argv_append(&argc, &(new_app_context->argv), "-l");
|
||||||
|
opal_argv_append(&argc, &(new_app_context->argv), location_str);
|
||||||
|
opal_argv_append(&argc, &(new_app_context->argv), "-m");
|
||||||
|
opal_argv_append(&argc, &(new_app_context->argv), orte_sstore_base_local_metadata_filename);
|
||||||
|
opal_argv_append(&argc, &(new_app_context->argv), "-r");
|
||||||
|
if( NULL != tmp_str ) {
|
||||||
|
free(tmp_str);
|
||||||
|
tmp_str = NULL;
|
||||||
|
}
|
||||||
|
asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid);
|
||||||
|
opal_argv_append(&argc, &(new_app_context->argv), tmp_str);
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
if( NULL != tmp_str) {
|
||||||
|
free(tmp_str);
|
||||||
|
tmp_str = NULL;
|
||||||
|
}
|
||||||
|
if( NULL != location_str ) {
|
||||||
|
free(location_str);
|
||||||
|
location_str = NULL;
|
||||||
|
}
|
||||||
|
if( NULL != cache_location_str ) {
|
||||||
|
free(cache_location_str);
|
||||||
|
cache_location_str = NULL;
|
||||||
|
}
|
||||||
|
if( NULL != reference_fmt_str ) {
|
||||||
|
free(reference_fmt_str);
|
||||||
|
reference_fmt_str = NULL;
|
||||||
|
}
|
||||||
|
if( NULL != ref_location_fmt_str ) {
|
||||||
|
free(ref_location_fmt_str);
|
||||||
|
ref_location_fmt_str = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#if OPAL_ENABLE_FT_CR
|
||||||
|
int orte_errmgr_base_restart_job(orte_jobid_t jobid, char * global_handle, int seq_num)
|
||||||
|
{
|
||||||
|
int ret, exit_status = ORTE_SUCCESS;
|
||||||
|
orte_process_name_t loc_proc;
|
||||||
|
orte_sstore_base_handle_t prev_sstore_handle = ORTE_SSTORE_HANDLE_INVALID;
|
||||||
|
|
||||||
|
/* JJH First determine if we can recover this way */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Find the corresponding sstore handle
|
||||||
|
*/
|
||||||
|
prev_sstore_handle = orte_sstore_handle_last_stable;
|
||||||
|
if( ORTE_SUCCESS != (ret = orte_sstore.request_restart_handle(&orte_sstore_handle_last_stable,
|
||||||
|
NULL,
|
||||||
|
global_handle,
|
||||||
|
seq_num,
|
||||||
|
NULL)) ) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Start the recovery
|
||||||
|
*/
|
||||||
|
orte_snapc_base_has_recovered = false;
|
||||||
|
loc_proc.jobid = jobid;
|
||||||
|
loc_proc.vpid = 0;
|
||||||
|
orte_errmgr_base_update_state(jobid, ORTE_JOB_STATE_RESTART,
|
||||||
|
&loc_proc, ORTE_PROC_STATE_KILLED_BY_CMD,
|
||||||
|
0, 0);
|
||||||
|
while( !orte_snapc_base_has_recovered ) {
|
||||||
|
opal_progress();
|
||||||
|
}
|
||||||
|
orte_sstore_handle_last_stable = prev_sstore_handle;
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
int orte_errmgr_base_migrate_job(orte_jobid_t jobid, orte_snapc_base_request_op_t *datum)
|
||||||
|
{
|
||||||
|
int ret, exit_status = ORTE_SUCCESS;
|
||||||
|
int i;
|
||||||
|
opal_list_t *proc_list = NULL;
|
||||||
|
opal_list_t *node_list = NULL;
|
||||||
|
opal_list_t *suggested_map_list = NULL;
|
||||||
|
orte_errmgr_predicted_map_t *onto_map = NULL;
|
||||||
|
#if 0
|
||||||
|
orte_errmgr_predicted_proc_t *off_proc = NULL;
|
||||||
|
orte_errmgr_predicted_node_t *off_node = NULL;
|
||||||
|
#endif
|
||||||
|
|
||||||
|
proc_list = OBJ_NEW(opal_list_t);
|
||||||
|
node_list = OBJ_NEW(opal_list_t);
|
||||||
|
suggested_map_list = OBJ_NEW(opal_list_t);
|
||||||
|
|
||||||
|
for( i = 0; i < datum->mig_num; ++i ) {
|
||||||
|
/*
|
||||||
|
* List all processes that are included in the migration.
|
||||||
|
* We will sort them out in the component.
|
||||||
|
*/
|
||||||
|
onto_map = OBJ_NEW(orte_errmgr_predicted_map_t);
|
||||||
|
|
||||||
|
if( (datum->mig_off_node)[i] ) {
|
||||||
|
onto_map->off_current_node = true;
|
||||||
|
} else {
|
||||||
|
onto_map->off_current_node = false;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Who to migrate */
|
||||||
|
onto_map->proc_name.jobid = jobid;
|
||||||
|
onto_map->proc_name.vpid = (datum->mig_vpids)[i];
|
||||||
|
|
||||||
|
/* Destination */
|
||||||
|
onto_map->map_proc_name.jobid = jobid;
|
||||||
|
onto_map->map_proc_name.vpid = (datum->mig_vpid_pref)[i];
|
||||||
|
|
||||||
|
if( ((datum->mig_host_pref)[i])[0] == '\0') {
|
||||||
|
onto_map->map_node_name = NULL;
|
||||||
|
} else {
|
||||||
|
onto_map->map_node_name = strdup((datum->mig_host_pref)[i]);
|
||||||
|
}
|
||||||
|
|
||||||
|
opal_list_append(suggested_map_list, &(onto_map->super));
|
||||||
|
}
|
||||||
|
|
||||||
|
if( ORTE_SUCCESS != (ret = orte_errmgr_base_predicted_fault(proc_list, node_list, suggested_map_list)) ) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif
|
||||||
|
|
||||||
|
/********************
|
||||||
|
* Local Functions
|
||||||
|
********************/
|
||||||
|
477
orte/mca/errmgr/base/errmgr_base_tool.c
Обычный файл
477
orte/mca/errmgr/base/errmgr_base_tool.c
Обычный файл
@ -0,0 +1,477 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "orte_config.h"
|
||||||
|
|
||||||
|
#ifdef HAVE_STRING_H
|
||||||
|
#include <string.h>
|
||||||
|
#endif
|
||||||
|
#if HAVE_SYS_TYPES_H
|
||||||
|
#include <sys/types.h>
|
||||||
|
#endif /* HAVE_SYS_TYPES_H */
|
||||||
|
#ifdef HAVE_UNISTD_H
|
||||||
|
#include <unistd.h>
|
||||||
|
#endif /* HAVE_UNISTD_H */
|
||||||
|
#if HAVE_SYS_TYPES_H
|
||||||
|
#include <sys/types.h>
|
||||||
|
#endif /* HAVE_SYS_TYPES_H */
|
||||||
|
#if HAVE_SYS_STAT_H
|
||||||
|
#include <sys/stat.h>
|
||||||
|
#endif /* HAVE_SYS_STAT_H */
|
||||||
|
#ifdef HAVE_DIRENT_H
|
||||||
|
#include <dirent.h>
|
||||||
|
#endif /* HAVE_DIRENT_H */
|
||||||
|
#include <time.h>
|
||||||
|
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/mca/base/base.h"
|
||||||
|
|
||||||
|
#include "opal/mca/base/mca_base_param.h"
|
||||||
|
#include "opal/util/os_dirpath.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
#include "opal/util/basename.h"
|
||||||
|
#include "opal/util/argv.h"
|
||||||
|
#include "opal/mca/crs/crs.h"
|
||||||
|
#include "opal/mca/crs/base/base.h"
|
||||||
|
|
||||||
|
#include "orte/mca/rml/rml.h"
|
||||||
|
#include "orte/mca/rml/rml_types.h"
|
||||||
|
#include "orte/mca/snapc/snapc.h"
|
||||||
|
#include "orte/runtime/orte_globals.h"
|
||||||
|
#include "orte/util/name_fns.h"
|
||||||
|
|
||||||
|
#include "orte/mca/errmgr/errmgr.h"
|
||||||
|
#include "orte/mca/errmgr/base/base.h"
|
||||||
|
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||||
|
|
||||||
|
/**
|
||||||
|
* This file contains function for the HNP to communicate with the
|
||||||
|
* orte-migrate command.
|
||||||
|
*/
|
||||||
|
#if OPAL_ENABLE_FT_CR
|
||||||
|
|
||||||
|
/******************
|
||||||
|
* Local Functions
|
||||||
|
******************/
|
||||||
|
static int errmgr_base_tool_start_cmdline_listener(void);
|
||||||
|
static int errmgr_base_tool_stop_cmdline_listener(void);
|
||||||
|
|
||||||
|
static void errmgr_base_tool_cmdline_recv(int status,
|
||||||
|
orte_process_name_t* sender,
|
||||||
|
opal_buffer_t* buffer,
|
||||||
|
orte_rml_tag_t tag,
|
||||||
|
void* cbdata);
|
||||||
|
static void errmgr_base_tool_cmdline_process_recv(int fd,
|
||||||
|
short event,
|
||||||
|
void *cbdata);
|
||||||
|
|
||||||
|
|
||||||
|
/******************
|
||||||
|
* Object stuff
|
||||||
|
******************/
|
||||||
|
static orte_process_name_t errmgr_cmdline_sender = {ORTE_JOBID_INVALID, ORTE_VPID_INVALID};
|
||||||
|
static bool errmgr_cmdline_recv_issued = false;
|
||||||
|
static int errmgr_tool_initialized = false;
|
||||||
|
|
||||||
|
/********************
|
||||||
|
* Module Functions
|
||||||
|
********************/
|
||||||
|
int orte_errmgr_base_tool_init(void)
|
||||||
|
{
|
||||||
|
int ret;
|
||||||
|
|
||||||
|
if( (++errmgr_tool_initialized) != 1 ) {
|
||||||
|
if( errmgr_tool_initialized < 1 ) {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Only HNP communicates with tools */
|
||||||
|
if (! ORTE_PROC_IS_HNP) {
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Setup command line migrate tool request listener
|
||||||
|
*/
|
||||||
|
if( ORTE_SUCCESS != (ret = errmgr_base_tool_start_cmdline_listener()) ) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int orte_errmgr_base_tool_finalize(void)
|
||||||
|
{
|
||||||
|
int ret;
|
||||||
|
|
||||||
|
if( (--errmgr_tool_initialized) != 0 ) {
|
||||||
|
if( errmgr_tool_initialized < 0 ) {
|
||||||
|
return OPAL_ERROR;
|
||||||
|
}
|
||||||
|
return OPAL_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Only HNP communicates with tools */
|
||||||
|
if (! ORTE_PROC_IS_HNP) {
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Clean up listeners
|
||||||
|
*/
|
||||||
|
if( ORTE_SUCCESS != (ret = errmgr_base_tool_stop_cmdline_listener()) ) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
int orte_errmgr_base_migrate_update(int status)
|
||||||
|
{
|
||||||
|
int ret, exit_status = ORTE_SUCCESS;
|
||||||
|
opal_buffer_t *loc_buffer = NULL;
|
||||||
|
orte_errmgr_tool_cmd_flag_t command = ORTE_ERRMGR_MIGRATE_TOOL_UPDATE_CMD;
|
||||||
|
|
||||||
|
/* Only HNP communicates with tools */
|
||||||
|
if (! ORTE_PROC_IS_HNP) {
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* If this is an invalid state, then return an error
|
||||||
|
*/
|
||||||
|
if( ORTE_ERRMGR_MIGRATE_MAX < status ) {
|
||||||
|
opal_output(orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:update() Error: Invalid state %d < (Max %d)",
|
||||||
|
status, ORTE_ERRMGR_MIGRATE_MAX);
|
||||||
|
return ORTE_ERR_BAD_PARAM;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* If the caller is indicating that they are finished and ready for another
|
||||||
|
* command, then repost the RML listener.
|
||||||
|
*/
|
||||||
|
if( ORTE_ERRMGR_MIGRATE_STATE_NONE == status ) {
|
||||||
|
if( ORTE_SUCCESS != (ret = errmgr_base_tool_start_cmdline_listener()) ) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
return ret;
|
||||||
|
}
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Noop if invalid peer, or peer not specified
|
||||||
|
*/
|
||||||
|
if( OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_NAME_INVALID, &errmgr_cmdline_sender) ) {
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Do not send to self, as that is silly.
|
||||||
|
*/
|
||||||
|
if( OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_HNP, &errmgr_cmdline_sender) ) {
|
||||||
|
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:update() Warning: Do not send to self!\n"));
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:update() Sending update command <status %d>\n",
|
||||||
|
status));
|
||||||
|
|
||||||
|
/********************
|
||||||
|
* Send over the status of the checkpoint
|
||||||
|
* - migration state
|
||||||
|
********************/
|
||||||
|
if (NULL == (loc_buffer = OBJ_NEW(opal_buffer_t))) {
|
||||||
|
exit_status = ORTE_ERROR;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &command, 1, ORTE_ERRMGR_MIGRATE_TOOL_CMD)) ) {
|
||||||
|
opal_output(orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:update() Error: DSS Pack (cmd) Failure (ret = %d)\n",
|
||||||
|
ret);
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &status, 1, OPAL_INT))) {
|
||||||
|
opal_output(orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:update() Error: DSS Pack (status) Failure (ret = %d)\n",
|
||||||
|
ret);
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (0 > (ret = orte_rml.send_buffer(&errmgr_cmdline_sender, loc_buffer, ORTE_RML_TAG_MIGRATE, 0))) {
|
||||||
|
opal_output(orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:update() Error: Send (status) Failure (ret = %d)\n",
|
||||||
|
ret);
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
if(NULL != loc_buffer) {
|
||||||
|
OBJ_RELEASE(loc_buffer);
|
||||||
|
loc_buffer = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
/********************
|
||||||
|
* Utility functions
|
||||||
|
********************/
|
||||||
|
|
||||||
|
/********************
|
||||||
|
* Local Functions
|
||||||
|
********************/
|
||||||
|
static int errmgr_base_tool_start_cmdline_listener(void)
|
||||||
|
{
|
||||||
|
int ret, exit_status = ORTE_SUCCESS;
|
||||||
|
|
||||||
|
if (errmgr_cmdline_recv_issued && ORTE_PROC_IS_HNP) {
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool: Startup Command Line Channel"));
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Coordinator command listener
|
||||||
|
*/
|
||||||
|
errmgr_cmdline_sender.jobid = ORTE_JOBID_INVALID;
|
||||||
|
errmgr_cmdline_sender.vpid = ORTE_VPID_INVALID;
|
||||||
|
if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
|
||||||
|
ORTE_RML_TAG_MIGRATE,
|
||||||
|
0,
|
||||||
|
errmgr_base_tool_cmdline_recv,
|
||||||
|
NULL))) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
errmgr_cmdline_recv_issued = true;
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
static int errmgr_base_tool_stop_cmdline_listener(void)
|
||||||
|
{
|
||||||
|
int ret, exit_status = ORTE_SUCCESS;
|
||||||
|
|
||||||
|
if (!errmgr_cmdline_recv_issued && ORTE_PROC_IS_HNP) {
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool: Shutdown Command Line Channel"));
|
||||||
|
|
||||||
|
if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD,
|
||||||
|
ORTE_RML_TAG_MIGRATE))) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
exit_status = ret;
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
errmgr_cmdline_recv_issued = false;
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
return exit_status;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*****************
|
||||||
|
* Listener Callbacks
|
||||||
|
*****************/
|
||||||
|
static void errmgr_base_tool_cmdline_recv(int status,
|
||||||
|
orte_process_name_t* sender,
|
||||||
|
opal_buffer_t* buffer,
|
||||||
|
orte_rml_tag_t tag,
|
||||||
|
void* cbdata)
|
||||||
|
{
|
||||||
|
if( ORTE_RML_TAG_MIGRATE != tag ) {
|
||||||
|
opal_output(orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:recv() Error: Unknown tag: Received a command message from %s (tag = %d).",
|
||||||
|
ORTE_NAME_PRINT(sender), tag);
|
||||||
|
ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:recv() Command Line: Start a migration operation [Sender = %s]",
|
||||||
|
ORTE_NAME_PRINT(sender)));
|
||||||
|
|
||||||
|
errmgr_cmdline_recv_issued = false; /* Not a persistent RML message */
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Do not process this right away - we need to get out of the recv before
|
||||||
|
* we process the message to avoid performing the rest of the job while
|
||||||
|
* inside this receive! Instead, setup an event so that the message gets processed
|
||||||
|
* as soon as we leave the recv.
|
||||||
|
*
|
||||||
|
* The macro makes a copy of the buffer, which we release above - the incoming
|
||||||
|
* buffer, however, is NOT released here, although its payload IS transferred
|
||||||
|
* to the message buffer for later processing
|
||||||
|
*
|
||||||
|
*/
|
||||||
|
ORTE_MESSAGE_EVENT(sender, buffer, tag, errmgr_base_tool_cmdline_process_recv);
|
||||||
|
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
static void errmgr_base_tool_cmdline_process_recv(int fd, short event, void *cbdata)
|
||||||
|
{
|
||||||
|
int ret;
|
||||||
|
orte_message_event_t *mev = (orte_message_event_t*)cbdata;
|
||||||
|
orte_process_name_t *sender = NULL, swap_dest;
|
||||||
|
orte_errmgr_tool_cmd_flag_t command;
|
||||||
|
orte_std_cntr_t count = 1;
|
||||||
|
char *off_nodes = NULL;
|
||||||
|
char *off_procs = NULL;
|
||||||
|
char *onto_nodes = NULL;
|
||||||
|
char **split_off_nodes = NULL;
|
||||||
|
char **split_off_procs = NULL;
|
||||||
|
char **split_onto_nodes = NULL;
|
||||||
|
opal_list_t *proc_list = NULL;
|
||||||
|
opal_list_t *node_list = NULL;
|
||||||
|
opal_list_t *suggested_map_list = NULL;
|
||||||
|
orte_errmgr_predicted_proc_t *off_proc = NULL;
|
||||||
|
orte_errmgr_predicted_node_t *off_node = NULL;
|
||||||
|
orte_errmgr_predicted_map_t *onto_map = NULL;
|
||||||
|
int cnt = 0, i;
|
||||||
|
|
||||||
|
sender = &(mev->sender);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* If we are already interacting with a command line tool then reject this
|
||||||
|
* request. Since we only allow the processing of one tool command at a
|
||||||
|
* time.
|
||||||
|
*/
|
||||||
|
if( OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_NAME_INVALID, &errmgr_cmdline_sender) ) {
|
||||||
|
swap_dest.jobid = errmgr_cmdline_sender.jobid;
|
||||||
|
swap_dest.vpid = errmgr_cmdline_sender.vpid;
|
||||||
|
|
||||||
|
errmgr_cmdline_sender = *sender;
|
||||||
|
orte_errmgr_base_migrate_update(ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS);
|
||||||
|
|
||||||
|
errmgr_cmdline_sender.jobid = swap_dest.jobid;
|
||||||
|
errmgr_cmdline_sender.vpid = swap_dest.vpid;
|
||||||
|
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
errmgr_cmdline_sender = *sender;
|
||||||
|
|
||||||
|
count = 1;
|
||||||
|
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &command, &count, ORTE_ERRMGR_MIGRATE_TOOL_CMD))) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* orte-migrate has requested that a checkpoint be taken
|
||||||
|
*/
|
||||||
|
if (ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD == command) {
|
||||||
|
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:recv() Command line requested process migration [command %d]\n",
|
||||||
|
command));
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Unpack the buffer from the orte-migrate command
|
||||||
|
*/
|
||||||
|
count = 1;
|
||||||
|
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(off_procs), &count, OPAL_STRING))) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(off_nodes), &count, OPAL_STRING))) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(onto_nodes), &count, OPAL_STRING))) {
|
||||||
|
ORTE_ERROR_LOG(ret);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Parse the comma separated list
|
||||||
|
*/
|
||||||
|
proc_list = OBJ_NEW(opal_list_t);
|
||||||
|
node_list = OBJ_NEW(opal_list_t);
|
||||||
|
suggested_map_list = OBJ_NEW(opal_list_t);
|
||||||
|
|
||||||
|
split_off_procs = opal_argv_split(off_procs, ',');
|
||||||
|
cnt = opal_argv_count(split_off_procs);
|
||||||
|
if( cnt > 0 ) {
|
||||||
|
for(i = 0; i < cnt; ++i) {
|
||||||
|
off_proc = OBJ_NEW(orte_errmgr_predicted_proc_t);
|
||||||
|
off_proc->proc_name.vpid = atoi(split_off_procs[i]);
|
||||||
|
opal_list_append(proc_list, &(off_proc->super));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
split_off_nodes = opal_argv_split(off_nodes, ',');
|
||||||
|
cnt = opal_argv_count(split_off_nodes);
|
||||||
|
if( cnt > 0 ) {
|
||||||
|
for(i = 0; i < cnt; ++i) {
|
||||||
|
off_node = OBJ_NEW(orte_errmgr_predicted_node_t);
|
||||||
|
off_node->node_name = strdup(split_off_nodes[i]);
|
||||||
|
opal_list_append(node_list, &(off_node->super));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
split_onto_nodes = opal_argv_split(onto_nodes, ',');
|
||||||
|
cnt = opal_argv_count(split_onto_nodes);
|
||||||
|
if( cnt > 0 ) {
|
||||||
|
for(i = 0; i < cnt; ++i) {
|
||||||
|
onto_map = OBJ_NEW(orte_errmgr_predicted_map_t);
|
||||||
|
onto_map->map_node_name = strdup(split_onto_nodes[i]);
|
||||||
|
opal_list_append(suggested_map_list, &(onto_map->super));
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Pass to the predicted fault function to see how they would like to progress
|
||||||
|
*/
|
||||||
|
orte_errmgr_base_predicted_fault(proc_list, node_list, suggested_map_list);
|
||||||
|
}
|
||||||
|
/*
|
||||||
|
* Unknown command
|
||||||
|
*/
|
||||||
|
else {
|
||||||
|
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||||
|
"errmgr:base:tool:recv() Command line sent an unknown command (command %d)\n",
|
||||||
|
command));
|
||||||
|
ORTE_ERROR_LOG(ORTE_ERR_NOT_SUPPORTED);
|
||||||
|
goto cleanup;
|
||||||
|
}
|
||||||
|
|
||||||
|
cleanup:
|
||||||
|
/* release the message event */
|
||||||
|
OBJ_RELEASE(mev);
|
||||||
|
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
#endif
|
@ -72,9 +72,10 @@ ORTE_DECLSPEC int orte_errmgr_base_abort(int error_code, char *fmt, ...)
|
|||||||
__opal_attribute_format__(__printf__, 2, 3)
|
__opal_attribute_format__(__printf__, 2, 3)
|
||||||
# endif
|
# endif
|
||||||
;
|
;
|
||||||
ORTE_DECLSPEC int orte_errmgr_base_predicted_fault(char ***proc_list,
|
|
||||||
char ***node_list,
|
ORTE_DECLSPEC int orte_errmgr_base_predicted_fault(opal_list_t *proc_list,
|
||||||
char ***suggested_nodes);
|
opal_list_t *node_list,
|
||||||
|
opal_list_t *suggested_map);
|
||||||
ORTE_DECLSPEC int orte_errmgr_base_suggest_map_targets(orte_proc_t *proc,
|
ORTE_DECLSPEC int orte_errmgr_base_suggest_map_targets(orte_proc_t *proc,
|
||||||
orte_node_t *oldnode,
|
orte_node_t *oldnode,
|
||||||
opal_list_t *node_list);
|
opal_list_t *node_list);
|
||||||
|
38
orte/mca/errmgr/crmig/Makefile.am
Обычный файл
38
orte/mca/errmgr/crmig/Makefile.am
Обычный файл
@ -0,0 +1,38 @@
|
|||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
dist_pkgdata_DATA = help-orte-errmgr-crmig.txt
|
||||||
|
|
||||||
|
sources = \
|
||||||
|
errmgr_crmig.h \
|
||||||
|
errmgr_crmig_component.c \
|
||||||
|
errmgr_crmig_module.c
|
||||||
|
|
||||||
|
# Make the output library in this directory, and name it either
|
||||||
|
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||||
|
# (for static builds).
|
||||||
|
|
||||||
|
if OMPI_BUILD_errmgr_crmig_DSO
|
||||||
|
component_noinst =
|
||||||
|
component_install = mca_errmgr_crmig.la
|
||||||
|
else
|
||||||
|
component_noinst = libmca_errmgr_crmig.la
|
||||||
|
component_install =
|
||||||
|
endif
|
||||||
|
|
||||||
|
mcacomponentdir = $(pkglibdir)
|
||||||
|
mcacomponent_LTLIBRARIES = $(component_install)
|
||||||
|
mca_errmgr_crmig_la_SOURCES = $(sources)
|
||||||
|
mca_errmgr_crmig_la_LDFLAGS = -module -avoid-version
|
||||||
|
|
||||||
|
noinst_LTLIBRARIES = $(component_noinst)
|
||||||
|
libmca_errmgr_crmig_la_SOURCES = $(sources)
|
||||||
|
libmca_errmgr_crmig_la_LDFLAGS = -module -avoid-version
|
20
orte/mca/errmgr/crmig/configure.m4
Обычный файл
20
orte/mca/errmgr/crmig/configure.m4
Обычный файл
@ -0,0 +1,20 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
# MCA_errmgr_crmig_CONFIG([action-if-found], [action-if-not-found])
|
||||||
|
# -----------------------------------------------------------
|
||||||
|
AC_DEFUN([MCA_errmgr_crmig_CONFIG],[
|
||||||
|
# If we don't want FT, don't compile this component
|
||||||
|
AS_IF([test "$opal_want_ft_cr" = "1"],
|
||||||
|
[$1],
|
||||||
|
[$2])
|
||||||
|
])dnl
|
14
orte/mca/errmgr/crmig/configure.params
Обычный файл
14
orte/mca/errmgr/crmig/configure.params
Обычный файл
@ -0,0 +1,14 @@
|
|||||||
|
# -*- shell-script -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
# All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
|
||||||
|
PARAM_INIT_FILE=errmgr_crmig_component.c
|
||||||
|
PARAM_CONFIG_FILES="Makefile"
|
93
orte/mca/errmgr/crmig/errmgr_crmig.h
Обычный файл
93
orte/mca/errmgr/crmig/errmgr_crmig.h
Обычный файл
@ -0,0 +1,93 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* @file
|
||||||
|
*
|
||||||
|
* Checkpoint/Restart Process Migration (CRMIG) ErrMgr component
|
||||||
|
*
|
||||||
|
* Simple, braindead implementation.
|
||||||
|
*/
|
||||||
|
|
||||||
|
#ifndef MCA_ERRMGR_CRMIG_EXPORT_H
|
||||||
|
#define MCA_ERRMGR_CRMIG_EXPORT_H
|
||||||
|
|
||||||
|
#include "orte_config.h"
|
||||||
|
|
||||||
|
#include "opal/mca/mca.h"
|
||||||
|
#include "opal/event/event.h"
|
||||||
|
|
||||||
|
#include "orte/mca/filem/filem.h"
|
||||||
|
#include "orte/mca/errmgr/errmgr.h"
|
||||||
|
|
||||||
|
BEGIN_C_DECLS
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local Component structures
|
||||||
|
*/
|
||||||
|
struct orte_errmgr_crmig_component_t {
|
||||||
|
orte_errmgr_base_component_t super; /** Base Errmgr component */
|
||||||
|
bool crmig_enabled;
|
||||||
|
bool timing_enabled;
|
||||||
|
};
|
||||||
|
typedef struct orte_errmgr_crmig_component_t orte_errmgr_crmig_component_t;
|
||||||
|
OPAL_MODULE_DECLSPEC extern orte_errmgr_crmig_component_t mca_errmgr_crmig_component;
|
||||||
|
|
||||||
|
int orte_errmgr_crmig_component_query(mca_base_module_t **module, int *priority);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Module functions: Global
|
||||||
|
*/
|
||||||
|
int orte_errmgr_crmig_global_module_init(void);
|
||||||
|
int orte_errmgr_crmig_global_module_finalize(void);
|
||||||
|
|
||||||
|
int orte_errmgr_crmig_global_update_state(orte_jobid_t job,
|
||||||
|
orte_job_state_t jobstate,
|
||||||
|
orte_process_name_t *proc_name,
|
||||||
|
orte_proc_state_t state,
|
||||||
|
pid_t pid,
|
||||||
|
orte_exit_code_t exit_code,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
|
||||||
|
int orte_errmgr_crmig_global_predicted_fault(opal_list_t *proc_list,
|
||||||
|
opal_list_t *node_list,
|
||||||
|
opal_list_t *suggested_map,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
int orte_errmgr_crmig_global_process_fault(orte_job_t *jdata,
|
||||||
|
orte_process_name_t *proc_name,
|
||||||
|
orte_proc_state_t state,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
int orte_errmgr_crmig_global_suggest_map_targets(orte_proc_t *proc,
|
||||||
|
orte_node_t *oldnode,
|
||||||
|
opal_list_t *node_list,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
|
||||||
|
int orte_errmgr_crmig_global_ft_event(int state);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Module functions: Local
|
||||||
|
*/
|
||||||
|
int orte_errmgr_crmig_local_module_init(void);
|
||||||
|
int orte_errmgr_crmig_local_module_finalize(void);
|
||||||
|
|
||||||
|
int orte_errmgr_crmig_local_update_state(orte_jobid_t job,
|
||||||
|
orte_job_state_t jobstate,
|
||||||
|
orte_process_name_t *proc_name,
|
||||||
|
orte_proc_state_t state,
|
||||||
|
pid_t pid,
|
||||||
|
orte_exit_code_t exit_code,
|
||||||
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
|
int orte_errmgr_crmig_local_ft_event(int state);
|
||||||
|
|
||||||
|
|
||||||
|
END_C_DECLS
|
||||||
|
|
||||||
|
#endif /* MCA_ERRMGR_CRMIG_EXPORT_H */
|
142
orte/mca/errmgr/crmig/errmgr_crmig_component.c
Обычный файл
142
orte/mca/errmgr/crmig/errmgr_crmig_component.c
Обычный файл
@ -0,0 +1,142 @@
|
|||||||
|
/*
|
||||||
|
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||||
|
* All rights reserved.
|
||||||
|
*
|
||||||
|
* $COPYRIGHT$
|
||||||
|
*
|
||||||
|
* Additional copyrights may follow
|
||||||
|
*
|
||||||
|
* $HEADER$
|
||||||
|
*/
|
||||||
|
|
||||||
|
#include "orte_config.h"
|
||||||
|
#include "opal/util/output.h"
|
||||||
|
|
||||||
|
#include "orte/mca/errmgr/errmgr.h"
|
||||||
|
#include "orte/mca/errmgr/base/base.h"
|
||||||
|
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||||
|
#include "errmgr_crmig.h"
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Public string for version number
|
||||||
|
*/
|
||||||
|
const char *orte_errmgr_crmig_component_version_string =
|
||||||
|
"ORTE ERRMGR crmig MCA component version " ORTE_VERSION;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Local functionality
|
||||||
|
*/
|
||||||
|
static int errmgr_crmig_open(void);
|
||||||
|
static int errmgr_crmig_close(void);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Instantiate the public struct with all of our public information
|
||||||
|
* and pointer to our public functions in it
|
||||||
|
*/
|
||||||
|
orte_errmgr_crmig_component_t mca_errmgr_crmig_component = {
|
||||||
|
/* First do the base component stuff */
|
||||||
|
{
|
||||||
|
/* Handle the general mca_component_t struct containing
|
||||||
|
* meta information about the component itcrmig
|
||||||
|
*/
|
||||||
|
{
|
||||||
|
ORTE_ERRMGR_BASE_VERSION_3_0_0,
|
||||||
|
/* Component name and version */
|
||||||
|
"crmig",
|
||||||
|
ORTE_MAJOR_VERSION,
|
||||||
|
ORTE_MINOR_VERSION,
|
||||||
|
ORTE_RELEASE_VERSION,
|
||||||
|
|
||||||
|
/* Component open and close functions */
|
||||||
|
errmgr_crmig_open,
|
||||||
|
errmgr_crmig_close,
|
||||||
|
orte_errmgr_crmig_component_query
|
||||||
|
},
|
||||||
|
{
|
||||||
|
/* The component is checkpoint ready */
|
||||||
|
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||||
|
},
|
||||||
|
|
||||||
|
/* Verbosity level */
|
||||||
|
0,
|
||||||
|
/* opal_output handler */
|
||||||
|
-1,
|
||||||
|
/* Default priority */
|
||||||
|
40
|
||||||
|
}
|
||||||
|
};
|
||||||
|
|
||||||
|
static int errmgr_crmig_open(void)
|
||||||
|
{
|
||||||
|
int val;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* This should be the last componet to ever get used since
|
||||||
|
* it doesn't do anything.
|
||||||
|
*/
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||||
|
"priority",
|
||||||
|
"Priority of the ERRMGR crmig component",
|
||||||
|
false, false,
|
||||||
|
mca_errmgr_crmig_component.super.priority,
|
||||||
|
&mca_errmgr_crmig_component.super.priority);
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||||
|
"verbose",
|
||||||
|
"Verbose level for the ERRMGR crmig component",
|
||||||
|
false, false,
|
||||||
|
mca_errmgr_crmig_component.super.verbose,
|
||||||
|
&mca_errmgr_crmig_component.super.verbose);
|
||||||
|
/* If there is a custom verbose level for this component than use it
|
||||||
|
* otherwise take our parents level and output channel
|
||||||
|
*/
|
||||||
|
if ( 0 != mca_errmgr_crmig_component.super.verbose) {
|
||||||
|
mca_errmgr_crmig_component.super.output_handle = opal_output_open(NULL);
|
||||||
|
opal_output_set_verbosity(mca_errmgr_crmig_component.super.output_handle,
|
||||||
|
mca_errmgr_crmig_component.super.verbose);
|
||||||
|
} else {
|
||||||
|
mca_errmgr_crmig_component.super.output_handle = orte_errmgr_base.output;
|
||||||
|
}
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||||
|
"timing",
|
||||||
|
"Enable Process Migration timer",
|
||||||
|
false, false,
|
||||||
|
0, &val);
|
||||||
|
mca_errmgr_crmig_component.timing_enabled = OPAL_INT_TO_BOOL(val);
|
||||||
|
|
||||||
|
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||||
|
"enable",
|
||||||
|
"Enable Process Migration (Default: 0/off)",
|
||||||
|
false, false,
|
||||||
|
0, &val);
|
||||||
|
mca_errmgr_crmig_component.crmig_enabled = OPAL_INT_TO_BOOL(val);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Debug Output
|
||||||
|
*/
|
||||||
|
opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle,
|
||||||
|
"errmgr:crmig: open()");
|
||||||
|
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||||
|
"errmgr:crmig: open: priority = %d",
|
||||||
|
mca_errmgr_crmig_component.super.priority);
|
||||||
|
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||||
|
"errmgr:crmig: open: verbosity = %d",
|
||||||
|
mca_errmgr_crmig_component.super.verbose);
|
||||||
|
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||||
|
"errmgr:crmig: open: Proc. Mig. = %s",
|
||||||
|
(mca_errmgr_crmig_component.crmig_enabled ? "Enabled" : "Disabled"));
|
||||||
|
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||||
|
"errmgr:crmig: open: timing = %s",
|
||||||
|
(mca_errmgr_crmig_component.timing_enabled ? "Enabled" : "Disabled"));
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
||||||
|
|
||||||
|
static int errmgr_crmig_close(void)
|
||||||
|
{
|
||||||
|
opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle,
|
||||||
|
"errmgr:crmig: close()");
|
||||||
|
|
||||||
|
return ORTE_SUCCESS;
|
||||||
|
}
|
1678
orte/mca/errmgr/crmig/errmgr_crmig_module.c
Обычный файл
1678
orte/mca/errmgr/crmig/errmgr_crmig_module.c
Обычный файл
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
27
orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt
Обычный файл
27
orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt
Обычный файл
@ -0,0 +1,27 @@
|
|||||||
|
-*- text -*-
|
||||||
|
#
|
||||||
|
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||||
|
# University Research and Technology
|
||||||
|
# Corporation. All rights reserved.
|
||||||
|
#
|
||||||
|
# $COPYRIGHT$
|
||||||
|
#
|
||||||
|
# Additional copyrights may follow
|
||||||
|
#
|
||||||
|
# $HEADER$
|
||||||
|
#
|
||||||
|
# This is the US/English general help file for ORTE ErrMgr CRMig framework.
|
||||||
|
#
|
||||||
|
[migrating_job]
|
||||||
|
Notice: A migration of this job has been requested.
|
||||||
|
The processes below will be migrated.
|
||||||
|
Please standby.
|
||||||
|
%s
|
||||||
|
[migrated_job]
|
||||||
|
Notice: The processes have been successfully migrated to/from the specified
|
||||||
|
machines.
|
||||||
|
[no_migrating_procs]
|
||||||
|
Warning: Could not find any processes to migrate on the nodes specified.
|
||||||
|
You provided the following:
|
||||||
|
Nodes: %s
|
||||||
|
Procs: %s
|
@ -79,6 +79,70 @@ BEGIN_C_DECLS
|
|||||||
/* type definition */
|
/* type definition */
|
||||||
typedef uint8_t orte_errmgr_stack_state_t;
|
typedef uint8_t orte_errmgr_stack_state_t;
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Structure to describe a predicted process fault.
|
||||||
|
*
|
||||||
|
* This can be expanded in the future to support assurance levels, and
|
||||||
|
* additional information that may wish to be conveyed.
|
||||||
|
*/
|
||||||
|
struct orte_errmgr_predicted_proc_t {
|
||||||
|
/** This is an object, so must have a super */
|
||||||
|
opal_list_item_t super;
|
||||||
|
|
||||||
|
/** Process Name */
|
||||||
|
orte_process_name_t proc_name;
|
||||||
|
};
|
||||||
|
typedef struct orte_errmgr_predicted_proc_t orte_errmgr_predicted_proc_t;
|
||||||
|
OBJ_CLASS_DECLARATION(orte_errmgr_predicted_proc_t);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Structure to describe a predicted node fault.
|
||||||
|
*
|
||||||
|
* This can be expanded in the future to support assurance levels, and
|
||||||
|
* additional information that may wish to be conveyed.
|
||||||
|
*/
|
||||||
|
struct orte_errmgr_predicted_node_t {
|
||||||
|
/** This is an object, so must have a super */
|
||||||
|
opal_list_item_t super;
|
||||||
|
|
||||||
|
/** Node Name */
|
||||||
|
char * node_name;
|
||||||
|
};
|
||||||
|
typedef struct orte_errmgr_predicted_node_t orte_errmgr_predicted_node_t;
|
||||||
|
OBJ_CLASS_DECLARATION(orte_errmgr_predicted_node_t);
|
||||||
|
|
||||||
|
/*
|
||||||
|
* Structure to describe a suggested remapping element for a predicted fault.
|
||||||
|
*
|
||||||
|
* This can be expanded in the future to support weights , and
|
||||||
|
* additional information that may wish to be conveyed.
|
||||||
|
*/
|
||||||
|
struct orte_errmgr_predicted_map_t {
|
||||||
|
/** This is an object, so must have a super */
|
||||||
|
opal_list_item_t super;
|
||||||
|
|
||||||
|
/** Process Name (predicted to fail) */
|
||||||
|
orte_process_name_t proc_name;
|
||||||
|
|
||||||
|
/** Node Name (predicted to fail) */
|
||||||
|
char * node_name;
|
||||||
|
|
||||||
|
/** Process Name (Map to) */
|
||||||
|
orte_process_name_t map_proc_name;
|
||||||
|
|
||||||
|
/** Node Name (Map to) */
|
||||||
|
char * map_node_name;
|
||||||
|
|
||||||
|
/** Just off current node */
|
||||||
|
bool off_current_node;
|
||||||
|
|
||||||
|
/** Pre-map fixed node assignment */
|
||||||
|
char * pre_map_fixed_node;
|
||||||
|
};
|
||||||
|
typedef struct orte_errmgr_predicted_map_t orte_errmgr_predicted_map_t;
|
||||||
|
OBJ_CLASS_DECLARATION(orte_errmgr_predicted_map_t);
|
||||||
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Macro definitions
|
* Macro definitions
|
||||||
*/
|
*/
|
||||||
@ -129,14 +193,15 @@ typedef int (*orte_errmgr_base_API_update_state_fn_t)(orte_jobid_t job,
|
|||||||
*
|
*
|
||||||
* @param[in] proc_list List of processes (or NULL if none)
|
* @param[in] proc_list List of processes (or NULL if none)
|
||||||
* @param[in] node_list List of nodes (or NULL if none)
|
* @param[in] node_list List of nodes (or NULL if none)
|
||||||
* @param[in] suggested_nodes List of suggested nodes to use on recovery (or NULL if none)
|
* @param[in] suggested_map List of mapping suggestions to use on recovery (or NULL if none)
|
||||||
*
|
*
|
||||||
* @retval ORTE_SUCCESS The operation completed successfully
|
* @retval ORTE_SUCCESS The operation completed successfully
|
||||||
* @retval ORTE_ERROR An unspecifed error occurred
|
* @retval ORTE_ERROR An unspecifed error occurred
|
||||||
*/
|
*/
|
||||||
typedef int (*orte_errmgr_base_API_predicted_fault_fn_t)(char ***proc_list,
|
typedef int (*orte_errmgr_base_API_predicted_fault_fn_t)(opal_list_t *proc_list,
|
||||||
char ***node_list,
|
opal_list_t *node_list,
|
||||||
char ***suggested_nodes);
|
opal_list_t *suggested_map);
|
||||||
|
|
||||||
/**
|
/**
|
||||||
* Suggest a node to map a restarting process onto
|
* Suggest a node to map a restarting process onto
|
||||||
*
|
*
|
||||||
@ -212,9 +277,9 @@ typedef int (*orte_errmgr_base_module_update_state_fn_t)(orte_jobid_t job,
|
|||||||
pid_t pid,
|
pid_t pid,
|
||||||
orte_exit_code_t exit_code,
|
orte_exit_code_t exit_code,
|
||||||
orte_errmgr_stack_state_t *stack_state);
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
typedef int (*orte_errmgr_base_module_predicted_fault_fn_t)(char ***proc_list,
|
typedef int (*orte_errmgr_base_module_predicted_fault_fn_t)(opal_list_t *proc_list,
|
||||||
char ***node_list,
|
opal_list_t *node_list,
|
||||||
char ***suggested_nodes,
|
opal_list_t *suggested_map,
|
||||||
orte_errmgr_stack_state_t *stack_state);
|
orte_errmgr_stack_state_t *stack_state);
|
||||||
typedef int (*orte_errmgr_base_module_suggest_map_targets_fn_t)(orte_proc_t *proc,
|
typedef int (*orte_errmgr_base_module_suggest_map_targets_fn_t)(orte_proc_t *proc,
|
||||||
orte_node_t *oldnode,
|
orte_node_t *oldnode,
|
||||||
|
Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше
Загрузка…
Ссылка в новой задаче
Block a user