A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
Этот коммит содержится в:
родитель
9fff01704f
Коммит
e12ca48cd9
@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -22,7 +22,9 @@ amca_paramdir = $(AMCA_PARAM_SETS_DIR)
|
||||
dist_amca_param_DATA = amca-param-sets/example.conf
|
||||
|
||||
if WANT_FT
|
||||
dist_amca_param_DATA += amca-param-sets/ft-enable-cr
|
||||
dist_amca_param_DATA += \
|
||||
amca-param-sets/ft-enable-cr \
|
||||
amca-param-sets/ft-enable-cr-recovery
|
||||
endif
|
||||
|
||||
EXTRA_DIST = \
|
||||
|
@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2008-2009 The Trustees of Indiana University and Indiana
|
||||
# Copyright (c) 2008-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
#
|
||||
@ -37,7 +37,6 @@ opal_cr_use_thread=1
|
||||
#
|
||||
rml_wrapper=ftrm
|
||||
snapc=full
|
||||
#filem=rsh
|
||||
|
||||
#
|
||||
# OMPI Parameters
|
||||
|
82
contrib/amca-param-sets/ft-enable-cr-recovery
Обычный файл
82
contrib/amca-param-sets/ft-enable-cr-recovery
Обычный файл
@ -0,0 +1,82 @@
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# An Aggregate MCA Parameter Set to enable checkpoint/restart capabilities
|
||||
# for a job.
|
||||
#
|
||||
# Usage:
|
||||
# shell$ mpirun -am ft-enable-cr ./app
|
||||
#
|
||||
|
||||
#
|
||||
# OPAL Parameters
|
||||
# - Turn off OPAL only checkpointing
|
||||
# - Select only checkpoint ready components
|
||||
# - Enable Additional FT infrastructure
|
||||
# - Auto-select OPAL CRS component
|
||||
# - If available, use the FT Thread (Default)
|
||||
#
|
||||
opal_cr_allow_opal_only=0
|
||||
mca_base_component_distill_checkpoint_ready=1
|
||||
ft_cr_enabled=1
|
||||
crs=
|
||||
opal_cr_use_thread=1
|
||||
|
||||
#
|
||||
# ORTE Parameters
|
||||
# - Wrap the RML
|
||||
# - Use the 'full' Snapshot Coordinator
|
||||
# - Use the 'cm' routed component. It is the only one that is currently able to
|
||||
# handle process and daemon loss.
|
||||
#
|
||||
rml_wrapper=ftrm
|
||||
snapc=full
|
||||
routed=cm
|
||||
|
||||
#
|
||||
# OMPI Parameters
|
||||
# - Wrap the PML
|
||||
# - Use a Bookmark Exchange Fully Coordinated Checkpoint/Restart Coordination Protocol
|
||||
#
|
||||
pml_wrapper=crcpw
|
||||
crcp=bkmrk
|
||||
|
||||
#
|
||||
# Temporary fix to force the event engine to use poll to behave well with BLCR
|
||||
#
|
||||
opal_event_include=poll
|
||||
|
||||
#
|
||||
# We currently only support the following options to the OpenIB BTL
|
||||
# Future development will attempt to eliminate many of these restrictions
|
||||
#
|
||||
btl_openib_want_fork_support=1
|
||||
btl_openib_use_async_event_thread=0
|
||||
btl_openib_use_eager_rdma=0
|
||||
btl_openib_cpc_include=oob
|
||||
|
||||
# Enable SIGTSTP/SIGCONT capability
|
||||
# killall -TSTP mpirun
|
||||
# killall -CONT mpirun
|
||||
orte_forward_job_control=1
|
||||
|
||||
#
|
||||
# Use the C/R Error Management and Recovery Service
|
||||
#
|
||||
orte_enable_recovery=1
|
||||
orte_max_global_restarts=10
|
||||
errmgr_crmig_enable=1
|
||||
errmgr_autor_enable=1
|
||||
|
||||
#
|
||||
# Additional constraints to be lifted in the future
|
||||
#
|
||||
plm=rsh
|
||||
rmaps=resilient
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
||||
@ -54,7 +54,7 @@ int mca_bml_r2_ft_event(int state)
|
||||
first_continue_pass = !first_continue_pass;
|
||||
|
||||
/* Since nothing in Checkpoint, we are fine here (unless required by BTL) */
|
||||
if( ompi_cr_continue_like_restart && !first_continue_pass) {
|
||||
if( orte_cr_continue_like_restart && !first_continue_pass) {
|
||||
procs = ompi_proc_all(&num_procs);
|
||||
if(NULL == procs) {
|
||||
return OMPI_ERR_OUT_OF_RESOURCE;
|
||||
@ -136,7 +136,7 @@ int mca_bml_r2_ft_event(int state)
|
||||
}
|
||||
else if(OPAL_CRS_CONTINUE == state) {
|
||||
/* Matches OPAL_CRS_RESTART_PRE */
|
||||
if( ompi_cr_continue_like_restart && first_continue_pass) {
|
||||
if( orte_cr_continue_like_restart && first_continue_pass) {
|
||||
if( OMPI_SUCCESS != (ret = mca_bml_r2_finalize()) ) {
|
||||
opal_output(0, "bml:r2: ft_event(Restart): Failed to finalize BML framework\n");
|
||||
return ret;
|
||||
@ -147,7 +147,7 @@ int mca_bml_r2_ft_event(int state)
|
||||
}
|
||||
}
|
||||
/* Matches OPAL_CRS_RESTART */
|
||||
else if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
||||
else if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||
/*
|
||||
* Barrier to make all processes have been successfully restarted before
|
||||
* we try to remove some restart only files.
|
||||
@ -157,10 +157,6 @@ int mca_bml_r2_ft_event(int state)
|
||||
return ret;
|
||||
}
|
||||
|
||||
opal_output_verbose(10, ompi_cr_output,
|
||||
"bml:r2: ft_event(Restart): Cleanup restart files\n");
|
||||
opal_crs_base_cleanup_flush();
|
||||
|
||||
/*
|
||||
* Re-open the BTL framework to get the full list of components.
|
||||
*/
|
||||
@ -234,10 +230,6 @@ int mca_bml_r2_ft_event(int state)
|
||||
return ret;
|
||||
}
|
||||
|
||||
opal_output_verbose(10, ompi_cr_output,
|
||||
"bml:r2: ft_event(Restart): Cleanup restart files\n");
|
||||
opal_crs_base_cleanup_flush();
|
||||
|
||||
/*
|
||||
* Re-open the BTL framework to get the full list of components.
|
||||
* - but first clear the MCA value that was there
|
||||
|
@ -641,7 +641,7 @@ int mca_btl_mx_ft_event(int state) {
|
||||
* kernel: blcr: thaw_threads returned error, aborting. -1
|
||||
* JJH: It may be possible to, instead of restarting the entire driver, just reconnect endpoints
|
||||
*/
|
||||
ompi_cr_continue_like_restart = true;
|
||||
orte_cr_continue_like_restart = true;
|
||||
|
||||
for( i = 0; i < mca_btl_mx_component.mx_num_btls; i++ ) {
|
||||
mx_btl = mca_btl_mx_component.mx_btls[i];
|
||||
|
@ -1735,7 +1735,7 @@ int mca_btl_openib_ft_event(int state) {
|
||||
if(OPAL_CRS_CHECKPOINT == state) {
|
||||
/* Continue must reconstruct the routes (including modex), since we
|
||||
* have to tear down the devices completely. */
|
||||
ompi_cr_continue_like_restart = true;
|
||||
orte_cr_continue_like_restart = true;
|
||||
|
||||
/*
|
||||
* To keep the node from crashing we need to call ibv_close_device
|
||||
|
@ -52,6 +52,7 @@
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
#include "opal/util/basename.h"
|
||||
#include "orte/mca/sstore/sstore.h"
|
||||
#include "ompi/runtime/ompi_cr.h"
|
||||
#endif
|
||||
|
||||
@ -1099,8 +1100,6 @@ int mca_btl_sm_ft_event(int state) {
|
||||
}
|
||||
#else
|
||||
int mca_btl_sm_ft_event(int state) {
|
||||
char * tmp_dir = NULL;
|
||||
|
||||
/* Notify mpool */
|
||||
if( NULL != mca_btl_sm_component.sm_mpool &&
|
||||
NULL != mca_btl_sm_component.sm_mpool->mpool_ft_event) {
|
||||
@ -1114,17 +1113,14 @@ int mca_btl_sm_ft_event(int state) {
|
||||
* for these old file handles. The restart procedure will make sure
|
||||
* these files get cleaned up appropriately.
|
||||
*/
|
||||
opal_crs_base_metadata_write_token(NULL, CRS_METADATA_TOUCH, mca_btl_sm_component.sm_seg->module_seg_path);
|
||||
|
||||
/* Record the job session directory */
|
||||
opal_crs_base_metadata_write_token(NULL, CRS_METADATA_MKDIR, orte_process_info.job_session_dir);
|
||||
orte_sstore.set_attr(orte_sstore_handle_current,
|
||||
SSTORE_METADATA_LOCAL_TOUCH,
|
||||
mca_btl_sm_component.sm_seg->module_seg_path);
|
||||
}
|
||||
}
|
||||
else if(OPAL_CRS_CONTINUE == state) {
|
||||
if( ompi_cr_continue_like_restart ) {
|
||||
if( orte_cr_continue_like_restart ) {
|
||||
if( NULL != mca_btl_sm_component.sm_seg ) {
|
||||
/* Do not Add session directory on continue */
|
||||
|
||||
/* Add shared memory file */
|
||||
opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false);
|
||||
}
|
||||
@ -1136,14 +1132,6 @@ int mca_btl_sm_ft_event(int state) {
|
||||
else if(OPAL_CRS_RESTART == state ||
|
||||
OPAL_CRS_RESTART_PRE == state) {
|
||||
if( NULL != mca_btl_sm_component.sm_seg ) {
|
||||
/* Add session directory */
|
||||
opal_crs_base_cleanup_append(orte_process_info.job_session_dir, true);
|
||||
tmp_dir = opal_dirname(orte_process_info.job_session_dir);
|
||||
if( NULL != tmp_dir ) {
|
||||
opal_crs_base_cleanup_append(tmp_dir, true);
|
||||
free(tmp_dir);
|
||||
tmp_dir = NULL;
|
||||
}
|
||||
/* Add shared memory file */
|
||||
opal_crs_base_cleanup_append(mca_btl_sm_component.sm_seg->module_seg_path, false);
|
||||
}
|
||||
|
@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -26,3 +26,4 @@ libmca_crcp_la_SOURCES += \
|
||||
base/crcp_base_close.c \
|
||||
base/crcp_base_select.c \
|
||||
base/crcp_base_fns.c
|
||||
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -60,6 +60,12 @@ BEGIN_C_DECLS
|
||||
*/
|
||||
OMPI_DECLSPEC int ompi_crcp_base_close(void);
|
||||
|
||||
/**
|
||||
* Quiesce Interface (For MPI Ext.)
|
||||
*/
|
||||
OMPI_DECLSPEC int ompi_crcp_base_quiesce_start(MPI_Info *info);
|
||||
OMPI_DECLSPEC int ompi_crcp_base_quiesce_end(MPI_Info *info);
|
||||
|
||||
/**
|
||||
* 'None' component functions
|
||||
* These are to be used when no component is selected.
|
||||
@ -72,6 +78,10 @@ BEGIN_C_DECLS
|
||||
int ompi_crcp_base_module_init(void);
|
||||
int ompi_crcp_base_module_finalize(void);
|
||||
|
||||
/* Quiesce Interface */
|
||||
int ompi_crcp_base_none_quiesce_start(MPI_Info *info);
|
||||
int ompi_crcp_base_none_quiesce_end(MPI_Info *info);
|
||||
|
||||
/* PML Interface */
|
||||
ompi_crcp_base_pml_state_t* ompi_crcp_base_none_pml_enable( bool enable, ompi_crcp_base_pml_state_t* );
|
||||
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -38,6 +38,7 @@
|
||||
#include "ompi/mca/crcp/crcp.h"
|
||||
#include "ompi/mca/crcp/base/base.h"
|
||||
#include "ompi/mca/bml/base/base.h"
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/mca/pml/pml.h"
|
||||
#include "ompi/mca/pml/base/base.h"
|
||||
#include "ompi/mca/pml/base/pml_base_request.h"
|
||||
@ -92,6 +93,19 @@ int ompi_crcp_base_module_finalize(void)
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
|
||||
/****************
|
||||
* MPI Quiesce Interface
|
||||
****************/
|
||||
int ompi_crcp_base_none_quiesce_start(MPI_Info *info)
|
||||
{
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
|
||||
int ompi_crcp_base_none_quiesce_end(MPI_Info *info)
|
||||
{
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
|
||||
/****************
|
||||
* PML Wrapper
|
||||
****************/
|
||||
@ -397,3 +411,24 @@ ompi_crcp_base_none_btl_ft_event(int state,
|
||||
/********************
|
||||
* Utility functions
|
||||
********************/
|
||||
|
||||
/******************
|
||||
* MPI Interface Functions
|
||||
******************/
|
||||
int ompi_crcp_base_quiesce_start(MPI_Info *info)
|
||||
{
|
||||
if( NULL != ompi_crcp.quiesce_start ) {
|
||||
return ompi_crcp.quiesce_start(info);
|
||||
} else {
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
}
|
||||
|
||||
int ompi_crcp_base_quiesce_end(MPI_Info *info)
|
||||
{
|
||||
if( NULL != ompi_crcp.quiesce_end ) {
|
||||
return ompi_crcp.quiesce_end(info);
|
||||
} else {
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
}
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -63,6 +63,10 @@ static ompi_crcp_base_module_t none_module = {
|
||||
/** Finalization Function */
|
||||
ompi_crcp_base_module_finalize,
|
||||
|
||||
/** Quiesce interface */
|
||||
ompi_crcp_base_none_quiesce_start,
|
||||
ompi_crcp_base_none_quiesce_end,
|
||||
|
||||
/** PML Wrapper */
|
||||
ompi_crcp_base_none_pml_enable,
|
||||
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -57,6 +57,12 @@ BEGIN_C_DECLS
|
||||
int ompi_crcp_bkmrk_pml_init(void);
|
||||
int ompi_crcp_bkmrk_pml_finalize(void);
|
||||
|
||||
/*
|
||||
* Quiesce Interface
|
||||
*/
|
||||
int ompi_crcp_bkmrk_quiesce_start(MPI_Info *info);
|
||||
int ompi_crcp_bkmrk_quiesce_end(MPI_Info *info);
|
||||
|
||||
END_C_DECLS
|
||||
|
||||
#endif /* MCA_CRCP_HOKE_EXPORT_H */
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2009 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -44,6 +44,10 @@ static ompi_crcp_base_module_t loc_module = {
|
||||
/** Finalization Function */
|
||||
ompi_crcp_bkmrk_module_finalize,
|
||||
|
||||
/** Quiesce interface */
|
||||
ompi_crcp_bkmrk_quiesce_start,
|
||||
ompi_crcp_bkmrk_quiesce_end,
|
||||
|
||||
/** PML Wrapper */
|
||||
NULL, /* ompi_crcp_bkmrk_pml_enable, */
|
||||
|
||||
@ -131,6 +135,34 @@ int ompi_crcp_bkmrk_module_finalize(void)
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
|
||||
int ompi_crcp_bkmrk_quiesce_start(MPI_Info *info)
|
||||
{
|
||||
OPAL_OUTPUT_VERBOSE((10, mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: quiesce_start(--)"));
|
||||
#if 0
|
||||
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_start(QUIESCE_TAG_CKPT)) ) {
|
||||
;
|
||||
}
|
||||
return OMPI_SUCCESS;
|
||||
#else
|
||||
return OMPI_ERR_NOT_IMPLEMENTED;
|
||||
#endif
|
||||
}
|
||||
|
||||
int ompi_crcp_bkmrk_quiesce_end(MPI_Info *info)
|
||||
{
|
||||
OPAL_OUTPUT_VERBOSE((10, mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: quiesce_end(--)"));
|
||||
#if 0
|
||||
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_end(QUIESCE_TAG_CONTINUE) ) ) {
|
||||
;
|
||||
}
|
||||
return OMPI_SUCCESS;
|
||||
#else
|
||||
return OMPI_ERR_NOT_IMPLEMENTED;
|
||||
#endif
|
||||
}
|
||||
|
||||
/******************
|
||||
* Local functions
|
||||
******************/
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2009 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2010 The University of Tennessee and The University
|
||||
* of Tennessee Research Foundation. All rights
|
||||
@ -2986,6 +2986,26 @@ int ompi_crcp_bkmrk_request_complete(struct ompi_request_t *request)
|
||||
}
|
||||
|
||||
/**************** FT Event *****************/
|
||||
int ompi_crcp_bkmrk_pml_quiesce_start(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ) {
|
||||
int ret, exit_status = OMPI_SUCCESS;
|
||||
|
||||
if( OMPI_SUCCESS != (ret = ft_event_coordinate_peers()) ) {
|
||||
exit_status = ret;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
int ompi_crcp_bkmrk_pml_quiesce_end(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag ) {
|
||||
int ret, exit_status = OMPI_SUCCESS;
|
||||
|
||||
if( OMPI_SUCCESS != (ret = ft_event_finalize_exchange() ) ) {
|
||||
exit_status = ret;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
||||
int state,
|
||||
ompi_crcp_base_pml_state_t* pml_state)
|
||||
@ -3027,7 +3047,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
||||
* When we return from this function we know that all of our
|
||||
* channels have been flushed.
|
||||
*/
|
||||
if( OMPI_SUCCESS != (ret = ft_event_coordinate_peers()) ) {
|
||||
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_start(QUIESCE_TAG_CKPT)) ) {
|
||||
opal_output(mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: %s ft_event: Checkpoint Coordination Failed %d",
|
||||
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||
@ -3060,7 +3080,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
||||
first_continue_pass = !first_continue_pass;
|
||||
|
||||
/* Only finalize the Protocol after the PML has been rebuilt */
|
||||
if( ompi_cr_continue_like_restart && first_continue_pass ) {
|
||||
if( orte_cr_continue_like_restart && first_continue_pass ) {
|
||||
goto DONE;
|
||||
}
|
||||
|
||||
@ -3069,7 +3089,7 @@ ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event(
|
||||
/*
|
||||
* Finish the coord protocol
|
||||
*/
|
||||
if( OMPI_SUCCESS != (ret = ft_event_finalize_exchange() ) ) {
|
||||
if( OMPI_SUCCESS != (ret = ompi_crcp_bkmrk_pml_quiesce_end(QUIESCE_TAG_CONTINUE) ) ) {
|
||||
opal_output(mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: pml_ft_event: Checkpoint Finalization Failed %d",
|
||||
ret);
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2007 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -116,6 +116,18 @@ BEGIN_C_DECLS
|
||||
ompi_crcp_base_pml_state_t* ompi_crcp_bkmrk_pml_ft_event
|
||||
(int state, ompi_crcp_base_pml_state_t* pml_state);
|
||||
|
||||
enum ompi_crcp_bkmrk_pml_quiesce_tag_type_t {
|
||||
QUIESCE_TAG_NONE = 0, /* 0 No tag specified */
|
||||
QUIESCE_TAG_CKPT, /* 1 Prepare for checkpoint */
|
||||
QUIESCE_TAG_CONTINUE, /* 2 Continue after a checkpoint */
|
||||
QUIESCE_TAG_RESTART, /* 3 Restart from a checkpoint */
|
||||
QUIESCE_TAG_UNKNOWN /* 4 Unknown */
|
||||
};
|
||||
typedef enum ompi_crcp_bkmrk_pml_quiesce_tag_type_t ompi_crcp_bkmrk_pml_quiesce_tag_type_t;
|
||||
|
||||
int ompi_crcp_bkmrk_pml_quiesce_start(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag );
|
||||
int ompi_crcp_bkmrk_pml_quiesce_end(ompi_crcp_bkmrk_pml_quiesce_tag_type_t tag );
|
||||
|
||||
/*
|
||||
* Request function
|
||||
*/
|
||||
|
@ -61,6 +61,23 @@ typedef int (*ompi_crcp_base_module_init_fn_t)
|
||||
typedef int (*ompi_crcp_base_module_finalize_fn_t)
|
||||
(void);
|
||||
|
||||
|
||||
/************************
|
||||
* MPI Quiesce Interface
|
||||
************************/
|
||||
/**
|
||||
* MPI_Quiesce_start component interface
|
||||
*/
|
||||
typedef int (*ompi_crcp_base_quiesce_start_fn_t)
|
||||
(MPI_Info *info);
|
||||
|
||||
/**
|
||||
* MPI_Quiesce_end component interface
|
||||
*/
|
||||
typedef int (*ompi_crcp_base_quiesce_end_fn_t)
|
||||
(MPI_Info *info);
|
||||
|
||||
|
||||
/************************
|
||||
* PML Wrapper hooks
|
||||
* PML Wrapper is the CRCPW PML component
|
||||
@ -283,6 +300,10 @@ struct ompi_crcp_base_module_1_0_0_t {
|
||||
/** Finalization Function */
|
||||
ompi_crcp_base_module_finalize_fn_t crcp_finalize;
|
||||
|
||||
/**< MPI_Quiesce Interface Functions ******************/
|
||||
ompi_crcp_base_quiesce_start_fn_t quiesce_start;
|
||||
ompi_crcp_base_quiesce_end_fn_t quiesce_end;
|
||||
|
||||
/**< PML Wrapper Functions ****************************/
|
||||
ompi_crcp_base_pml_enable_fn_t pml_enable;
|
||||
|
||||
|
@ -32,6 +32,7 @@
|
||||
#include "orte/util/proc_info.h"
|
||||
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
#include "orte/mca/sstore/sstore.h"
|
||||
#include "ompi/mca/mpool/base/base.h"
|
||||
#include "ompi/runtime/ompi_cr.h"
|
||||
#endif
|
||||
@ -169,12 +170,12 @@ int mca_mpool_sm_ft_event(int state) {
|
||||
asprintf( &file_name, "%s"OPAL_PATH_SEP"shared_mem_pool.%s",
|
||||
orte_process_info.job_session_dir,
|
||||
orte_process_info.nodename );
|
||||
opal_crs_base_metadata_write_token(NULL, CRS_METADATA_TOUCH, file_name);
|
||||
orte_sstore.set_attr(orte_sstore_handle_current, SSTORE_METADATA_LOCAL_TOUCH, file_name);
|
||||
free(file_name);
|
||||
file_name = NULL;
|
||||
}
|
||||
else if(OPAL_CRS_CONTINUE == state) {
|
||||
if(ompi_cr_continue_like_restart) {
|
||||
if(orte_cr_continue_like_restart) {
|
||||
/* Find the sm module */
|
||||
self_module = mca_mpool_base_module_lookup("sm");
|
||||
self_sm_module = (mca_mpool_sm_module_t*) self_module;
|
||||
|
@ -691,7 +691,7 @@ int mca_pml_bfo_ft_event( int state )
|
||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
||||
}
|
||||
|
||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
||||
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||
/*
|
||||
* Get a list of processes
|
||||
*/
|
||||
@ -791,7 +791,7 @@ int mca_pml_bfo_ft_event( int state )
|
||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
||||
}
|
||||
|
||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
||||
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||
/*
|
||||
* Exchange the modex information once again.
|
||||
* BTLs will have republished their modex information.
|
||||
|
@ -669,7 +669,7 @@ int mca_pml_csum_ft_event( int state )
|
||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
||||
}
|
||||
|
||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
||||
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||
/*
|
||||
* Get a list of processes
|
||||
*/
|
||||
@ -769,7 +769,7 @@ int mca_pml_csum_ft_event( int state )
|
||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
||||
}
|
||||
|
||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
||||
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||
/*
|
||||
* Exchange the modex information once again.
|
||||
* BTLs will have republished their modex information.
|
||||
|
@ -638,7 +638,7 @@ int mca_pml_ob1_ft_event( int state )
|
||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P2);
|
||||
}
|
||||
|
||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
||||
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||
/*
|
||||
* Get a list of processes
|
||||
*/
|
||||
@ -738,7 +738,7 @@ int mca_pml_ob1_ft_event( int state )
|
||||
OPAL_CR_SET_TIMER(OPAL_CR_TIMER_P2P3);
|
||||
}
|
||||
|
||||
if( ompi_cr_continue_like_restart && !first_continue_pass ) {
|
||||
if( orte_cr_continue_like_restart && !first_continue_pass ) {
|
||||
/*
|
||||
* Exchange the modex information once again.
|
||||
* BTLs will have republished their modex information.
|
||||
|
38
ompi/mpiext/cr/Makefile.am
Обычный файл
38
ompi/mpiext/cr/Makefile.am
Обычный файл
@ -0,0 +1,38 @@
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
headers = \
|
||||
mpiext_cr_c.h
|
||||
|
||||
sources = \
|
||||
c/checkpoint.c \
|
||||
c/restart.c \
|
||||
c/migrate.c \
|
||||
c/inc_register_callback.c \
|
||||
c/quiesce_start.c \
|
||||
c/quiesce_end.c \
|
||||
c/quiesce_checkpoint.c \
|
||||
c/self_register_checkpoint.c \
|
||||
c/self_register_restart.c \
|
||||
c/self_register_continue.c
|
||||
|
||||
lib = libext_mpiext_cr.la
|
||||
lib_sources = $(sources)
|
||||
|
||||
extcomponentdir = $(pkglibdir)
|
||||
|
||||
noinst_LTLIBRARIES = $(lib)
|
||||
libext_mpiext_cr_la_SOURCES = $(lib_sources)
|
||||
libext_mpiext_cr_la_LDFLAGS = -module -avoid-version
|
||||
|
||||
ompidir = $(includedir)/openmpi/ompi/mpiext/cr
|
||||
ompi_HEADERS = \
|
||||
$(headers)
|
88
ompi/mpiext/cr/c/checkpoint.c
Обычный файл
88
ompi/mpiext/cr/c/checkpoint.c
Обычный файл
@ -0,0 +1,88 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_Checkpoint";
|
||||
#define HANDLE_SIZE_MAX 256
|
||||
|
||||
int OMPI_CR_Checkpoint(char **handle, int *seq, MPI_Info *info)
|
||||
{
|
||||
int ret = MPI_SUCCESS;
|
||||
MPI_Comm comm = MPI_COMM_WORLD;
|
||||
orte_snapc_base_request_op_t *datum = NULL;
|
||||
int state = 0;
|
||||
int my_rank;
|
||||
|
||||
/* argument checking */
|
||||
if (MPI_PARAM_CHECK) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
/*
|
||||
* Setup the data structure for the operation
|
||||
*/
|
||||
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||
datum->event = ORTE_SNAPC_OP_CHECKPOINT;
|
||||
datum->is_active = true;
|
||||
|
||||
MPI_Comm_rank(comm, &my_rank);
|
||||
if( 0 == my_rank ) {
|
||||
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||
} else {
|
||||
datum->leader = -1; /* Unknown from non-root ranks */
|
||||
}
|
||||
|
||||
/*
|
||||
* All processes must make this call before it can start
|
||||
*/
|
||||
MPI_Barrier(comm);
|
||||
|
||||
/*
|
||||
* Leader sends the request
|
||||
*/
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
ret = orte_snapc.request_op(datum);
|
||||
if( OMPI_SUCCESS != ret ) {
|
||||
OBJ_RELEASE(datum);
|
||||
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||
FUNC_NAME);
|
||||
}
|
||||
OPAL_CR_EXIT_LIBRARY();
|
||||
|
||||
/*
|
||||
* Leader then sends out the commit message
|
||||
*/
|
||||
if( datum->leader == (int)ORTE_PROC_MY_NAME->vpid ) {
|
||||
*handle = strdup(datum->global_handle);
|
||||
*seq = datum->seq_num;
|
||||
state = 0;
|
||||
} else {
|
||||
*handle = (char*)malloc(sizeof(char)*HANDLE_SIZE_MAX);
|
||||
}
|
||||
|
||||
MPI_Bcast(&state, 1, MPI_INT, 0, comm);
|
||||
MPI_Bcast(seq, 1, MPI_INT, 0, comm);
|
||||
MPI_Bcast(*handle, HANDLE_SIZE_MAX, MPI_CHAR, 0, comm);
|
||||
|
||||
datum->is_active = false;
|
||||
OBJ_RELEASE(datum);
|
||||
|
||||
return ret;
|
||||
}
|
39
ompi/mpiext/cr/c/inc_register_callback.c
Обычный файл
39
ompi/mpiext/cr/c/inc_register_callback.c
Обычный файл
@ -0,0 +1,39 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "ompi/errhandler/errhandler.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_INC_register_callback";
|
||||
|
||||
int OMPI_CR_INC_register_callback(OMPI_CR_INC_callback_event_t event,
|
||||
OMPI_CR_INC_callback_function function,
|
||||
OMPI_CR_INC_callback_function *prev_function)
|
||||
{
|
||||
int rc;
|
||||
|
||||
if ( MPI_PARAM_CHECK ) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
|
||||
rc = opal_cr_user_inc_register_callback(event, function, prev_function);
|
||||
|
||||
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||
}
|
120
ompi/mpiext/cr/c/migrate.c
Обычный файл
120
ompi/mpiext/cr/c/migrate.c
Обычный файл
@ -0,0 +1,120 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_Migrate";
|
||||
|
||||
int OMPI_CR_Migrate(MPI_Comm comm, char *hostname, int rank, MPI_Info *info)
|
||||
{
|
||||
int ret = MPI_SUCCESS;
|
||||
orte_snapc_base_request_op_t *datum = NULL;
|
||||
int my_rank, my_size, i;
|
||||
char loc_hostname[MPI_MAX_PROCESSOR_NAME];
|
||||
int my_vpid;
|
||||
int info_flag;
|
||||
char info_value[6];
|
||||
int my_off_node = (int)false;
|
||||
|
||||
/* argument checking */
|
||||
if (MPI_PARAM_CHECK) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
/*
|
||||
* Setup the data structure for the operation
|
||||
*/
|
||||
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||
datum->event = ORTE_SNAPC_OP_MIGRATE;
|
||||
datum->is_active = true;
|
||||
|
||||
MPI_Comm_rank(comm, &my_rank);
|
||||
MPI_Comm_size(comm, &my_size);
|
||||
if( 0 == my_rank ) {
|
||||
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||
} else {
|
||||
datum->leader = -1; /* Unknown from non-root ranks */
|
||||
}
|
||||
|
||||
/*
|
||||
* Gather all preferences to the root
|
||||
*/
|
||||
if( NULL == hostname ) {
|
||||
loc_hostname[0] = '\0';
|
||||
} else {
|
||||
strncpy(loc_hostname, hostname, strlen(hostname));
|
||||
loc_hostname[strlen(hostname)] = '\0';
|
||||
}
|
||||
my_vpid = (int) ORTE_PROC_MY_NAME->vpid;
|
||||
|
||||
if( 0 == my_rank ) {
|
||||
datum->mig_num = my_size;
|
||||
datum->mig_vpids = malloc(sizeof(int) * my_size);
|
||||
datum->mig_host_pref = malloc(sizeof(char) * my_size * MPI_MAX_PROCESSOR_NAME);
|
||||
datum->mig_vpid_pref = malloc(sizeof(int) * my_size);
|
||||
datum->mig_off_node = malloc(sizeof(int) * my_size);
|
||||
|
||||
for( i = 0; i < my_size; ++i ) {
|
||||
(datum->mig_vpids)[i] = 0;
|
||||
(datum->mig_host_pref)[i][0] = '\0';
|
||||
(datum->mig_vpid_pref)[i] = 0;
|
||||
(datum->mig_off_node)[i] = (int)false;
|
||||
}
|
||||
}
|
||||
|
||||
my_off_node = (int)false;
|
||||
if( NULL != info ) {
|
||||
MPI_Info_get(*info, "CR_OFF_NODE", 5, info_value, &info_flag);
|
||||
if( info_flag ) {
|
||||
if( 0 == strncmp(info_value, "true", strlen("true")) ) {
|
||||
my_off_node = (int)true;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
MPI_Gather(&my_vpid, 1, MPI_INT,
|
||||
(datum->mig_vpids), 1, MPI_INT, 0, comm);
|
||||
MPI_Gather(loc_hostname, MPI_MAX_PROCESSOR_NAME, MPI_CHAR,
|
||||
(datum->mig_host_pref), MPI_MAX_PROCESSOR_NAME, MPI_CHAR, 0, comm);
|
||||
MPI_Gather(&my_vpid, 1, MPI_INT,
|
||||
(datum->mig_vpid_pref), 1, MPI_INT, 0, comm);
|
||||
MPI_Gather(&my_off_node, 1, MPI_INT,
|
||||
(datum->mig_off_node), 1, MPI_INT, 0, comm);
|
||||
|
||||
/*
|
||||
* Leader sends the request
|
||||
*/
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
ret = orte_snapc.request_op(datum);
|
||||
if( OMPI_SUCCESS != ret ) {
|
||||
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||
FUNC_NAME);
|
||||
}
|
||||
OPAL_CR_EXIT_LIBRARY();
|
||||
|
||||
datum->is_active = false;
|
||||
OBJ_RELEASE(datum);
|
||||
|
||||
/*
|
||||
* All processes must sync before leaving
|
||||
*/
|
||||
MPI_Barrier(comm);
|
||||
|
||||
return ret;
|
||||
}
|
69
ompi/mpiext/cr/c/quiesce_checkpoint.c
Обычный файл
69
ompi/mpiext/cr/c/quiesce_checkpoint.c
Обычный файл
@ -0,0 +1,69 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_Quiesce_checkpoint";
|
||||
|
||||
int OMPI_CR_Quiesce_checkpoint(MPI_Comm commP, char **handle, int *seq, MPI_Info *info)
|
||||
{
|
||||
int ret = MPI_SUCCESS;
|
||||
MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */
|
||||
orte_snapc_base_request_op_t *datum = NULL;
|
||||
int my_rank;
|
||||
|
||||
/* argument checking */
|
||||
if (MPI_PARAM_CHECK) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
/*
|
||||
* Setup the data structure for the operation
|
||||
*/
|
||||
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||
datum->event = ORTE_SNAPC_OP_QUIESCE_CHECKPOINT;
|
||||
datum->is_active = true;
|
||||
|
||||
MPI_Comm_rank(comm, &my_rank);
|
||||
if( 0 == my_rank ) {
|
||||
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||
} else {
|
||||
datum->leader = -1; /* Unknown from non-root ranks */
|
||||
}
|
||||
|
||||
/*
|
||||
* Since we are quiescent, then this is a local operation
|
||||
*/
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
ret = orte_snapc.request_op(datum);
|
||||
/*ret = ompi_crcp_base_quiesce_start(info);*/
|
||||
if( OMPI_SUCCESS != ret ) {
|
||||
OMPI_ERRHANDLER_INVOKE(MPI_COMM_WORLD, MPI_ERR_OTHER,
|
||||
FUNC_NAME);
|
||||
}
|
||||
OPAL_CR_EXIT_LIBRARY();
|
||||
|
||||
*handle = strdup(datum->global_handle);
|
||||
*seq = datum->seq_num;
|
||||
|
||||
datum->is_active = false;
|
||||
OBJ_RELEASE(datum);
|
||||
|
||||
return ret;
|
||||
}
|
74
ompi/mpiext/cr/c/quiesce_end.c
Обычный файл
74
ompi/mpiext/cr/c/quiesce_end.c
Обычный файл
@ -0,0 +1,74 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_Quiesce_end";
|
||||
|
||||
int OMPI_CR_Quiesce_end(MPI_Comm commP, MPI_Info *info)
|
||||
{
|
||||
int ret = MPI_SUCCESS;
|
||||
MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */
|
||||
orte_snapc_base_request_op_t *datum = NULL;
|
||||
int my_rank;
|
||||
|
||||
/* argument checking */
|
||||
if (MPI_PARAM_CHECK) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
/*
|
||||
* Setup the data structure for the operation
|
||||
*/
|
||||
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||
datum->event = ORTE_SNAPC_OP_QUIESCE_END;
|
||||
datum->is_active = true;
|
||||
|
||||
MPI_Comm_rank(comm, &my_rank);
|
||||
if( 0 == my_rank ) {
|
||||
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||
} else {
|
||||
datum->leader = -1; /* Unknown from non-root ranks */
|
||||
}
|
||||
|
||||
/*
|
||||
* Leader sends the request
|
||||
*/
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
ret = orte_snapc.request_op(datum);
|
||||
/*ret = ompi_crcp_base_quiesce_end(info);*/
|
||||
if( OMPI_SUCCESS != ret ) {
|
||||
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||
FUNC_NAME);
|
||||
}
|
||||
OPAL_CR_EXIT_LIBRARY();
|
||||
|
||||
/*
|
||||
* All processes must make this call before it can complete
|
||||
*/
|
||||
MPI_Barrier(comm);
|
||||
|
||||
/*
|
||||
* (Old) info logic
|
||||
*/
|
||||
/*cur_datum.epoch = -1;*/
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
210
ompi/mpiext/cr/c/quiesce_start.c
Обычный файл
210
ompi/mpiext/cr/c/quiesce_start.c
Обычный файл
@ -0,0 +1,210 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_Quiesce_start";
|
||||
|
||||
int OMPI_CR_Quiesce_start(MPI_Comm commP, MPI_Info *info)
|
||||
{
|
||||
int ret = MPI_SUCCESS;
|
||||
MPI_Comm comm = MPI_COMM_WORLD; /* Currently ignore provided comm */
|
||||
orte_snapc_base_request_op_t *datum = NULL;
|
||||
int my_rank;
|
||||
|
||||
/* argument checking */
|
||||
if (MPI_PARAM_CHECK) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
/*
|
||||
* Setup the data structure for the operation
|
||||
*/
|
||||
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||
datum->event = ORTE_SNAPC_OP_QUIESCE_START;
|
||||
datum->is_active = true;
|
||||
|
||||
MPI_Comm_rank(comm, &my_rank);
|
||||
if( 0 == my_rank ) {
|
||||
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||
} else {
|
||||
datum->leader = -1; /* Unknown from non-root ranks */
|
||||
}
|
||||
|
||||
/*
|
||||
* All processes must make this call before it can start
|
||||
*/
|
||||
MPI_Barrier(comm);
|
||||
|
||||
/*
|
||||
* Leader sends the request
|
||||
*/
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
ret = orte_snapc.request_op(datum);
|
||||
/*ret = ompi_crcp_base_quiesce_start(info);*/
|
||||
if( OMPI_SUCCESS != ret ) {
|
||||
OBJ_RELEASE(datum);
|
||||
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||
FUNC_NAME);
|
||||
}
|
||||
|
||||
OPAL_CR_EXIT_LIBRARY();
|
||||
|
||||
datum->is_active = false;
|
||||
OBJ_RELEASE(datum);
|
||||
|
||||
/*
|
||||
* (Old) info logic
|
||||
*/
|
||||
/*ompi_info_set((ompi_info_t*)*info, "target", cur_datum.target_dir);*/
|
||||
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*****************
|
||||
* Local Functions
|
||||
******************/
|
||||
#if 0
|
||||
/* Info keys:
|
||||
*
|
||||
* - crs:
|
||||
* none = (Default) No CRS Service
|
||||
* default = Whatever CRS service MPI chooses
|
||||
* blcr = BLCR
|
||||
* self = app level callbacks
|
||||
*
|
||||
* - cmdline:
|
||||
* Command line to restart the process with.
|
||||
* If empty, the user must manually enter it
|
||||
*
|
||||
* - target:
|
||||
* Absolute path to the target directory.
|
||||
*
|
||||
* - handle:
|
||||
* first = Earliest checkpoint directory available
|
||||
* last = Most recent checkpoint directory available
|
||||
* [global:local] = handle provided by the MPI library
|
||||
*
|
||||
* - restarting:
|
||||
* 0 = not restarting
|
||||
* 1 = restarting
|
||||
*
|
||||
* - checkpointing:
|
||||
* 0 = No need to prepare for checkpointing
|
||||
* 1 = MPI should prepare for checkpointing
|
||||
*
|
||||
* - inflight:
|
||||
* default = message
|
||||
* message = Drain inflight messages at the message level
|
||||
* network = Drain inflight messages at the network level (if possible)
|
||||
*
|
||||
* - user_space_mem:
|
||||
* 0 = Memory does not need to be managed
|
||||
* 1 = Memory must be in user space (i.e., not on network card
|
||||
*
|
||||
*/
|
||||
static int extract_info_into_datum(ompi_info_t *info, orte_snapc_base_quiesce_t *datum)
|
||||
{
|
||||
int info_flag = false;
|
||||
int max_crs_len = 32;
|
||||
bool info_bool = false;
|
||||
char *info_char = NULL;
|
||||
|
||||
info_char = (char *) malloc(sizeof(char) * (OPAL_PATH_MAX+1));
|
||||
|
||||
/*
|
||||
* Key: crs
|
||||
*/
|
||||
ompi_info_get(info, "crs", max_crs_len, info_char, &info_flag);
|
||||
if( info_flag) {
|
||||
datum->crs_name = strdup(info_char);
|
||||
}
|
||||
|
||||
/*
|
||||
* Key: cmdline
|
||||
*/
|
||||
ompi_info_get(info, "cmdline", OPAL_PATH_MAX, info_char, &info_flag);
|
||||
if( info_flag) {
|
||||
datum->cmdline = strdup(info_char);
|
||||
}
|
||||
|
||||
/*
|
||||
* Key: handle
|
||||
*/
|
||||
ompi_info_get(info, "handle", OPAL_PATH_MAX, info_char, &info_flag);
|
||||
if( info_flag) {
|
||||
datum->handle = strdup(info_char);
|
||||
}
|
||||
|
||||
/*
|
||||
* Key: target
|
||||
*/
|
||||
ompi_info_get(info, "target", OPAL_PATH_MAX, info_char, &info_flag);
|
||||
if( info_flag) {
|
||||
datum->target_dir = strdup(info_char);
|
||||
}
|
||||
|
||||
/*
|
||||
* Key: restarting
|
||||
*/
|
||||
ompi_info_get_bool(info, "restarting", &info_bool, &info_flag);
|
||||
if( info_flag ) {
|
||||
datum->restarting = info_bool;
|
||||
} else {
|
||||
datum->restarting = false;
|
||||
}
|
||||
|
||||
/*
|
||||
* Key: checkpointing
|
||||
*/
|
||||
ompi_info_get_bool(info, "checkpointing", &info_bool, &info_flag);
|
||||
if( info_flag ) {
|
||||
datum->checkpointing = info_bool;
|
||||
} else {
|
||||
datum->checkpointing = false;
|
||||
}
|
||||
|
||||
/*
|
||||
* Display all values
|
||||
*/
|
||||
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: %s extract_info: Info('crs' = '%s')",
|
||||
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||
(NULL == datum->crs_name ? "Default (none)" : datum->crs_name)));
|
||||
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: %s extract_info: Info('cmdline' = '%s')",
|
||||
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||
(NULL == datum->cmdline ? "Default ()" : datum->cmdline)));
|
||||
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: %s extract_info: Info('checkpointing' = '%c')",
|
||||
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||
(datum->checkpointing ? 'T' : 'F')));
|
||||
OPAL_OUTPUT_VERBOSE((3, mca_crcp_bkmrk_component.super.output_handle,
|
||||
"crcp:bkmrk: %s extract_info: Info('restarting' = '%c')",
|
||||
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
|
||||
(datum->restarting ? 'T' : 'F')));
|
||||
|
||||
if( NULL != info_char ) {
|
||||
free(info_char);
|
||||
info_char = NULL;
|
||||
}
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
#endif
|
66
ompi/mpiext/cr/c/restart.c
Обычный файл
66
ompi/mpiext/cr/c/restart.c
Обычный файл
@ -0,0 +1,66 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "ompi/info/info.h"
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_Restart";
|
||||
|
||||
int OMPI_CR_Restart(char *handle, int seq, MPI_Info *info)
|
||||
{
|
||||
int ret = MPI_SUCCESS;
|
||||
MPI_Comm comm = MPI_COMM_WORLD;
|
||||
orte_snapc_base_request_op_t *datum = NULL;
|
||||
|
||||
/* argument checking */
|
||||
if (MPI_PARAM_CHECK) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
/*
|
||||
* Setup the data structure for the operation
|
||||
*/
|
||||
datum = OBJ_NEW(orte_snapc_base_request_op_t);
|
||||
datum->event = ORTE_SNAPC_OP_RESTART;
|
||||
datum->is_active = true;
|
||||
|
||||
/*
|
||||
* Restart is not collective, so the caller is the leader
|
||||
*/
|
||||
datum->leader = ORTE_PROC_MY_NAME->vpid;
|
||||
datum->seq_num = seq;
|
||||
datum->global_handle = strdup(handle);
|
||||
|
||||
/*
|
||||
* Leader sends the request
|
||||
*/
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
ret = orte_snapc.request_op(datum);
|
||||
if( OMPI_SUCCESS != ret ) {
|
||||
OMPI_ERRHANDLER_INVOKE(comm, MPI_ERR_OTHER,
|
||||
FUNC_NAME);
|
||||
}
|
||||
OPAL_CR_EXIT_LIBRARY();
|
||||
|
||||
datum->is_active = false;
|
||||
OBJ_RELEASE(datum);
|
||||
|
||||
/********** If successful, should never reach this point (JJH) ******/
|
||||
|
||||
return ret;
|
||||
}
|
39
ompi/mpiext/cr/c/self_register_checkpoint.c
Обычный файл
39
ompi/mpiext/cr/c/self_register_checkpoint.c
Обычный файл
@ -0,0 +1,39 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "ompi/errhandler/errhandler.h"
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_self_register_checkpoint_callback";
|
||||
|
||||
int OMPI_CR_self_register_checkpoint_callback(OMPI_CR_self_checkpoint_fn function)
|
||||
{
|
||||
int rc;
|
||||
|
||||
if ( MPI_PARAM_CHECK ) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
|
||||
rc = opal_crs_base_self_register_checkpoint_callback(function);
|
||||
|
||||
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||
}
|
39
ompi/mpiext/cr/c/self_register_continue.c
Обычный файл
39
ompi/mpiext/cr/c/self_register_continue.c
Обычный файл
@ -0,0 +1,39 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "ompi/errhandler/errhandler.h"
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_self_register_continue_callback";
|
||||
|
||||
int OMPI_CR_self_register_continue_callback(OMPI_CR_self_continue_fn function)
|
||||
{
|
||||
int rc;
|
||||
|
||||
if ( MPI_PARAM_CHECK ) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
|
||||
rc = opal_crs_base_self_register_continue_callback(function);
|
||||
|
||||
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||
}
|
39
ompi/mpiext/cr/c/self_register_restart.c
Обычный файл
39
ompi/mpiext/cr/c/self_register_restart.c
Обычный файл
@ -0,0 +1,39 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#include "ompi_config.h"
|
||||
#include <stdio.h>
|
||||
|
||||
#include "ompi/mpi/c/bindings.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
#include "ompi/mpiext/cr/mpiext_cr_c.h"
|
||||
|
||||
#include "ompi/runtime/params.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "ompi/errhandler/errhandler.h"
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
|
||||
static const char FUNC_NAME[] = "OMPI_CR_self_register_restart_callback";
|
||||
|
||||
int OMPI_CR_self_register_restart_callback(OMPI_CR_self_restart_fn function)
|
||||
{
|
||||
int rc;
|
||||
|
||||
if ( MPI_PARAM_CHECK ) {
|
||||
OMPI_ERR_INIT_FINALIZE(FUNC_NAME);
|
||||
}
|
||||
|
||||
OPAL_CR_ENTER_LIBRARY();
|
||||
|
||||
rc = opal_crs_base_self_register_restart_callback(function);
|
||||
|
||||
OMPI_ERRHANDLER_RETURN(rc, MPI_COMM_WORLD, rc, FUNC_NAME);
|
||||
}
|
19
ompi/mpiext/cr/configure.m4
Обычный файл
19
ompi/mpiext/cr/configure.m4
Обычный файл
@ -0,0 +1,19 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
# EXT_ompi_cr_CONFIG([action-if-found], [action-if-not-found])
|
||||
# -----------------------------------------------------------
|
||||
AC_DEFUN([EXT_mpiext_cr_CONFIG],[
|
||||
# If we don't want FT, don't compile this component
|
||||
AS_IF([test "$opal_want_ft_cr" = "1"],
|
||||
[$1],
|
||||
[$2])
|
||||
])dnl
|
12
ompi/mpiext/cr/configure.params
Обычный файл
12
ompi/mpiext/cr/configure.params
Обычный файл
@ -0,0 +1,12 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
PARAM_CONFIG_FILES="Makefile"
|
82
ompi/mpiext/cr/mpiext_cr_c.h
Обычный файл
82
ompi/mpiext/cr/mpiext_cr_c.h
Обычный файл
@ -0,0 +1,82 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*
|
||||
*/
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
|
||||
/********************************
|
||||
* C/R Interfaces
|
||||
********************************/
|
||||
/*
|
||||
* Request a checkpoint
|
||||
*/
|
||||
OMPI_DECLSPEC int OMPI_CR_Checkpoint(char **handle, int *seq, MPI_Info *info);
|
||||
|
||||
/*
|
||||
* Request a restart
|
||||
*/
|
||||
OMPI_DECLSPEC int OMPI_CR_Restart(char *handle, int seq, MPI_Info *info);
|
||||
|
||||
|
||||
/********************************
|
||||
* Migration Interface
|
||||
********************************/
|
||||
/*
|
||||
* Request a migration
|
||||
*/
|
||||
OMPI_DECLSPEC int OMPI_CR_Migrate(MPI_Comm comm, char *hostname, int rank, MPI_Info *info);
|
||||
|
||||
|
||||
/********************************
|
||||
* INC Interfaces
|
||||
********************************/
|
||||
typedef opal_cr_user_inc_callback_event_t OMPI_CR_INC_callback_event_t;
|
||||
|
||||
typedef opal_cr_user_inc_callback_state_t OMPI_CR_INC_callback_state_t;
|
||||
|
||||
typedef int (*OMPI_CR_INC_callback_function)(OMPI_CR_INC_callback_event_t event,
|
||||
OMPI_CR_INC_callback_state_t state);
|
||||
|
||||
OMPI_DECLSPEC int OMPI_CR_INC_register_callback(OMPI_CR_INC_callback_event_t event,
|
||||
OMPI_CR_INC_callback_function function,
|
||||
OMPI_CR_INC_callback_function *prev_function);
|
||||
|
||||
|
||||
/********************************
|
||||
* SELF CRS Application Interfaces
|
||||
********************************/
|
||||
typedef int (*OMPI_CR_self_checkpoint_fn)(char **restart_cmd);
|
||||
typedef int (*OMPI_CR_self_restart_fn)(void);
|
||||
typedef int (*OMPI_CR_self_continue_fn)(void);
|
||||
|
||||
OMPI_DECLSPEC int OMPI_CR_self_register_checkpoint_callback(OMPI_CR_self_checkpoint_fn function);
|
||||
OMPI_DECLSPEC int OMPI_CR_self_register_restart_callback(OMPI_CR_self_restart_fn function);
|
||||
OMPI_DECLSPEC int OMPI_CR_self_register_continue_callback(OMPI_CR_self_continue_fn function);
|
||||
|
||||
|
||||
/********************************
|
||||
* Quiescence Interfaces
|
||||
********************************/
|
||||
/*
|
||||
* Start the Quiescent region.
|
||||
* Note: 'comm' required to be MPI_COMM_WORLD
|
||||
*/
|
||||
OMPI_DECLSPEC int OMPI_CR_Quiesce_start(MPI_Comm comm, MPI_Info *info);
|
||||
|
||||
/*
|
||||
* Request a checkpoint during a quiescent region
|
||||
* Note: 'comm' required to be MPI_COMM_WORLD
|
||||
*/
|
||||
OMPI_DECLSPEC int OMPI_CR_Quiesce_checkpoint(MPI_Comm comm, char **handle, int *seq, MPI_Info *info);
|
||||
|
||||
/*
|
||||
* End the Quiescent Region
|
||||
* Note: 'comm' required to be MPI_COMM_WORLD
|
||||
*/
|
||||
OMPI_DECLSPEC int OMPI_CR_Quiesce_end(MPI_Comm comm, MPI_Info *info);
|
@ -1,6 +1,6 @@
|
||||
/* -*- Mode: C; c-basic-offset:4 ; -*- */
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2007 The University of Tennessee and The University
|
||||
@ -43,6 +43,7 @@
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
#include "opal/mca/installdirs/installdirs.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
@ -56,6 +57,18 @@
|
||||
#include "ompi/mca/crcp/base/base.h"
|
||||
#include "ompi/communicator/communicator.h"
|
||||
#include "ompi/runtime/ompi_cr.h"
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
#include "orte/runtime/orte_globals.h"
|
||||
#include "ompi/debuggers/debuggers.h"
|
||||
#endif
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
OMPI_DECLSPEC int MPIR_checkpointable = 0;
|
||||
OMPI_DECLSPEC char * MPIR_controller_hostname = NULL;
|
||||
OMPI_DECLSPEC char * MPIR_checkpoint_command = NULL;
|
||||
OMPI_DECLSPEC char * MPIR_restart_command = NULL;
|
||||
OMPI_DECLSPEC char * MPIR_checkpoint_listing_command = NULL;
|
||||
#endif
|
||||
|
||||
/*************
|
||||
* Local functions
|
||||
@ -68,8 +81,6 @@ static int ompi_cr_coord_post_ckpt(void);
|
||||
static int ompi_cr_coord_post_restart(void);
|
||||
static int ompi_cr_coord_post_continue(void);
|
||||
|
||||
bool ompi_cr_continue_like_restart = false;
|
||||
|
||||
/*************
|
||||
* Local vars
|
||||
*************/
|
||||
@ -157,15 +168,59 @@ int ompi_cr_init(void)
|
||||
ompi_cr_output = opal_cr_output;
|
||||
}
|
||||
|
||||
/* Typically this is not needed. Individual BTLs will set this as needed */
|
||||
ompi_cr_continue_like_restart = false;
|
||||
|
||||
opal_output_verbose(10, ompi_cr_output,
|
||||
"ompi_cr: init: ompi_cr_init()");
|
||||
|
||||
/* Register the OMPI interlevel coordination callback */
|
||||
opal_cr_reg_coord_callback(ompi_cr_coord, &prev_coord_callback);
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/* Check for C/R enabled debugging */
|
||||
if( MPIR_debug_with_checkpoint ) {
|
||||
char *uri = NULL;
|
||||
char *sep = NULL;
|
||||
char *hostname = NULL;
|
||||
|
||||
/* Mark as debuggable with C/R */
|
||||
MPIR_checkpointable = 1;
|
||||
|
||||
/* Set the checkpoint and restart commands */
|
||||
/* Add the full path to the binary */
|
||||
asprintf(&MPIR_checkpoint_command,
|
||||
"%s/ompi-checkpoint --crdebug --hnp-jobid %u",
|
||||
opal_install_dirs.bindir,
|
||||
ORTE_PROC_MY_HNP->jobid);
|
||||
asprintf(&MPIR_restart_command,
|
||||
"%s/ompi-restart --crdebug ",
|
||||
opal_install_dirs.bindir);
|
||||
asprintf(&MPIR_checkpoint_listing_command,
|
||||
"%s/ompi-checkpoint -l --crdebug ",
|
||||
opal_install_dirs.bindir);
|
||||
|
||||
/* Set contact information for HNP */
|
||||
uri = strdup(orte_process_info.my_hnp_uri);
|
||||
hostname = strchr(uri, ';') + 1;
|
||||
sep = strchr(hostname, ';');
|
||||
if (sep) {
|
||||
*sep = 0;
|
||||
}
|
||||
if (strncmp(hostname, "tcp://", 6) == 0) {
|
||||
hostname += 6;
|
||||
sep = strchr(hostname, ':');
|
||||
*sep = 0;
|
||||
MPIR_controller_hostname = strdup(hostname);
|
||||
} else {
|
||||
MPIR_controller_hostname = strdup("localhost");
|
||||
}
|
||||
|
||||
/* Cleanup */
|
||||
if( NULL != uri ) {
|
||||
free(uri);
|
||||
uri = NULL;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
||||
return OMPI_SUCCESS;
|
||||
}
|
||||
|
||||
@ -196,9 +251,6 @@ int ompi_cr_coord(int state)
|
||||
* take action given the state.
|
||||
*/
|
||||
if(OPAL_CRS_CHECKPOINT == state) {
|
||||
/* Default: use the fast way */
|
||||
ompi_cr_continue_like_restart = false;
|
||||
|
||||
/* Do Checkpoint Phase work */
|
||||
ret = ompi_cr_coord_pre_ckpt();
|
||||
if( ret == OMPI_EXISTS) {
|
||||
@ -245,10 +297,30 @@ int ompi_cr_coord(int state)
|
||||
else if (OPAL_CRS_CONTINUE == state ) {
|
||||
/* Do Continue Phase work */
|
||||
ompi_cr_coord_post_continue();
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/*
|
||||
* If C/R enabled debugging,
|
||||
* wait here for debugger to attach
|
||||
*/
|
||||
if( MPIR_debug_with_checkpoint ) {
|
||||
MPIR_checkpoint_debugger_breakpoint();
|
||||
}
|
||||
#endif
|
||||
}
|
||||
else if (OPAL_CRS_RESTART == state ) {
|
||||
/* Do Restart Phase work */
|
||||
ompi_cr_coord_post_restart();
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/*
|
||||
* If C/R enabled debugging,
|
||||
* wait here for debugger to attach
|
||||
*/
|
||||
if( MPIR_debug_with_checkpoint ) {
|
||||
MPIR_checkpoint_debugger_breakpoint();
|
||||
}
|
||||
#endif
|
||||
}
|
||||
else if (OPAL_CRS_TERM == state ) {
|
||||
/* Do Continue Phase work in prep to terminate the application */
|
||||
@ -330,7 +402,7 @@ static int ompi_cr_coord_pre_continue(void) {
|
||||
opal_output_verbose(10, ompi_cr_output,
|
||||
"ompi_cr: coord_pre_continue: ompi_cr_coord_pre_continue()");
|
||||
|
||||
if( ompi_cr_continue_like_restart ) {
|
||||
if( orte_cr_continue_like_restart ) {
|
||||
/* Mimic ompi_cr_coord_pre_restart(); */
|
||||
if( ORTE_SUCCESS != (ret = mca_pml.pml_ft_event(OPAL_CRS_CONTINUE))) {
|
||||
exit_status = ret;
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -26,6 +26,7 @@
|
||||
#define OMPI_CR_H
|
||||
|
||||
#include "ompi_config.h"
|
||||
#include "orte/runtime/orte_cr.h"
|
||||
|
||||
BEGIN_C_DECLS
|
||||
|
||||
@ -49,11 +50,13 @@ BEGIN_C_DECLS
|
||||
*/
|
||||
OMPI_DECLSPEC extern int ompi_cr_output;
|
||||
|
||||
/*
|
||||
* If one of the BTLs that shutdown require a full, clean rebuild of the
|
||||
* point-to-point stack on 'continue' as well as 'restart'.
|
||||
*/
|
||||
OPAL_DECLSPEC extern bool ompi_cr_continue_like_restart;
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
OMPI_DECLSPEC extern int MPIR_checkpointable;
|
||||
OMPI_DECLSPEC extern char * MPIR_controller_hostname;
|
||||
OMPI_DECLSPEC extern char * MPIR_checkpoint_command;
|
||||
OMPI_DECLSPEC extern char * MPIR_restart_command;
|
||||
OMPI_DECLSPEC extern char * MPIR_checkpoint_listing_command;
|
||||
#endif
|
||||
|
||||
END_C_DECLS
|
||||
|
||||
|
@ -51,6 +51,8 @@
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#endif
|
||||
#include "opal/runtime/opal.h"
|
||||
#include "opal/dss/dss.h"
|
||||
@ -114,6 +116,8 @@
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
#include "orte/mca/snapc/base/base.h"
|
||||
#include "orte/mca/sstore/sstore.h"
|
||||
#include "orte/mca/sstore/base/base.h"
|
||||
#endif
|
||||
#if ORTE_ENABLE_SENSORS
|
||||
#include "orte/mca/sensor/sensor.h"
|
||||
@ -330,6 +334,14 @@ void ompi_info_open_components(void)
|
||||
map->type = strdup("crs");
|
||||
map->components = &opal_crs_base_components_available;
|
||||
opal_pointer_array_add(&component_map, map);
|
||||
|
||||
if (OPAL_SUCCESS != opal_compress_base_open()) {
|
||||
goto error;
|
||||
}
|
||||
map = OBJ_NEW(ompi_info_component_map_t);
|
||||
map->type = strdup("compress");
|
||||
map->components = &opal_compress_base_components_available;
|
||||
opal_pointer_array_add(&component_map, map);
|
||||
#endif
|
||||
|
||||
/* OPAL's installdirs base open has already been called as part of
|
||||
@ -460,6 +472,14 @@ void ompi_info_open_components(void)
|
||||
opal_pointer_array_add(&component_map, map);
|
||||
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
if (ORTE_SUCCESS != orte_sstore_base_open()) {
|
||||
goto error;
|
||||
}
|
||||
map = OBJ_NEW(ompi_info_component_map_t);
|
||||
map->type = strdup("sstore");
|
||||
map->components = &orte_sstore_base_components_available;
|
||||
opal_pointer_array_add(&component_map, map);
|
||||
|
||||
if (ORTE_SUCCESS != orte_snapc_base_open()) {
|
||||
goto error;
|
||||
}
|
||||
@ -680,6 +700,7 @@ void ompi_info_close_components()
|
||||
#if !ORTE_DISABLE_FULL_SUPPORT
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
(void) orte_snapc_base_close();
|
||||
(void) orte_sstore_base_close();
|
||||
#endif
|
||||
(void) orte_filem_base_close();
|
||||
(void) orte_iof_base_close();
|
||||
|
@ -37,6 +37,9 @@
|
||||
#include "opal/class/opal_object.h"
|
||||
#include "opal/class/opal_pointer_array.h"
|
||||
#include "opal/runtime/opal.h"
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
#endif
|
||||
#include "opal/util/cmd_line.h"
|
||||
#include "opal/util/argv.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
@ -196,7 +199,9 @@ int main(int argc, char *argv[])
|
||||
opal_pointer_array_add(&mca_types, "installdirs");
|
||||
opal_pointer_array_add(&mca_types, "sysinfo");
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
opal_cr_set_enabled(true);
|
||||
opal_pointer_array_add(&mca_types, "crs");
|
||||
opal_pointer_array_add(&mca_types, "compress");
|
||||
#endif
|
||||
opal_pointer_array_add(&mca_types, "dpm");
|
||||
opal_pointer_array_add(&mca_types, "pubsub");
|
||||
@ -228,6 +233,7 @@ int main(int argc, char *argv[])
|
||||
opal_pointer_array_add(&mca_types, "routed");
|
||||
opal_pointer_array_add(&mca_types, "plm");
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
opal_pointer_array_add(&mca_types, "sstore");
|
||||
opal_pointer_array_add(&mca_types, "snapc");
|
||||
#endif
|
||||
#if ORTE_ENABLE_SENSORS
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2006 The University of Tennessee and The University
|
||||
@ -515,6 +515,7 @@ void ompi_info_do_config(bool want_all)
|
||||
char *wtime_support;
|
||||
char *symbol_visibility;
|
||||
char *ft_support;
|
||||
char *crdebug_support;
|
||||
/* Do a little preprocessor trickery here to figure ompi_info_out the
|
||||
* tri-state of MPI_PARAM_CHECK (which will be either 0, 1, or
|
||||
* ompi_mpi_param_check). The preprocessor will only allow
|
||||
@ -583,6 +584,9 @@ void ompi_info_do_config(bool want_all)
|
||||
asprintf(&ft_support, "%s (checkpoint thread: %s)",
|
||||
OPAL_ENABLE_FT ? "yes" : "no", OPAL_ENABLE_FT_THREAD ? "yes" : "no");;
|
||||
|
||||
asprintf(&crdebug_support, "%s",
|
||||
OPAL_ENABLE_CRDEBUG ? "yes" : "no");
|
||||
|
||||
/* output values */
|
||||
ompi_info_out("Configured by", "config:user", OMPI_CONFIGURE_USER);
|
||||
ompi_info_out("Configured on", "config:timestamp", OMPI_CONFIGURE_DATE);
|
||||
@ -834,6 +838,9 @@ void ompi_info_do_config(bool want_all)
|
||||
ompi_info_out("FT Checkpoint support", "options:ft_support", ft_support);
|
||||
free(ft_support);
|
||||
|
||||
ompi_info_out("C/R Enabled Debugging", "options:crdebug_support", crdebug_support);
|
||||
free(crdebug_support);
|
||||
|
||||
ompi_info_out_int("MPI_MAX_PROCESSOR_NAME", "options:mpi-max-processor-name",
|
||||
MPI_MAX_PROCESSOR_NAME);
|
||||
ompi_info_out_int("MPI_MAX_ERROR_STRING", "options:mpi-max-error-string",
|
||||
|
@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -39,6 +39,7 @@ install-exec-hook:
|
||||
if WANT_FT
|
||||
(cd $(DESTDIR)$(bindir); rm -f ompi-checkpoint$(EXEEXT); $(LN_S) orte-checkpoint$(EXEEXT) ompi-checkpoint$(EXEEXT))
|
||||
(cd $(DESTDIR)$(bindir); rm -f ompi-restart$(EXEEXT); $(LN_S) orte-restart$(EXEEXT) ompi-restart$(EXEEXT))
|
||||
(cd $(DESTDIR)$(bindir); rm -f ompi-migrate$(EXEEXT); $(LN_S) orte-migrate$(EXEEXT) ompi-migrate$(EXEEXT))
|
||||
endif
|
||||
|
||||
uninstall-local:
|
||||
@ -50,7 +51,8 @@ uninstall-local:
|
||||
$(DESTDIR)$(bindir)/ompi-top$(EXEEXT)
|
||||
if WANT_FT
|
||||
rm -f $(DESTDIR)$(bindir)/ompi-checkpoint$(EXEEXT) \
|
||||
$(DESTDIR)$(bindir)/ompi-restart$(EXEEXT)
|
||||
$(DESTDIR)$(bindir)/ompi-restart$(EXEEXT) \
|
||||
$(DESTDIR)$(bindir)/ompi-migrate$(EXEEXT)
|
||||
endif
|
||||
|
||||
endif # !ORTE_DISABLE_FULL_SUPPORT
|
||||
@ -95,6 +97,12 @@ $(top_builddir)/orte/tools/orte-restart/orte-restart.1:
|
||||
ompi-restart.1: $(top_builddir)/orte/tools/orte-restart/orte-restart.1
|
||||
cp -f $(top_builddir)/orte/tools/orte-restart/orte-restart.1 ompi-restart.1
|
||||
|
||||
$(top_builddir)/orte/tools/orte-migrate/orte-migrate.1:
|
||||
(cd $(top_builddir)/orte/tools/orte-migrate && $(MAKE) $(AM_MAKEFLAGS) orte-migrate.1)
|
||||
|
||||
ompi-migrate.1: $(top_builddir)/orte/tools/orte-migrate/orte-migrate.1
|
||||
cp -f $(top_builddir)/orte/tools/orte-migrate/orte-migrate.1 ompi-migrate.1
|
||||
|
||||
$(top_builddir)/orte/tools/orte-top/orte-top.1:
|
||||
(cd $(top_builddir)/orte/tools/orte-top && $(MAKE) $(AM_MAKEFLAGS) orte-top.1)
|
||||
|
||||
|
@ -541,4 +541,27 @@ OPAL_WITH_OPTION_MIN_MAX_VALUE(datarep_string, 128, 64, 256)
|
||||
AC_ARG_WITH([libltdl],
|
||||
[AC_HELP_STRING([--with-libltdl(=DIR)],
|
||||
[Where to find libltdl (this option is ignored if --disable-dlopen is used). DIR can take one of three values: "internal", "external", or a valid directory name. "internal" (or no DIR value) forces Open MPI to use its internal copy of libltdl. "external" forces Open MPI to use an external installation of libltdl. Supplying a valid directory name also forces Open MPI to use an external installation of libltdl, and adds DIR/include, DIR/lib, and DIR/lib64 to the search path for headers and libraries.])])
|
||||
|
||||
#
|
||||
# Checkpoint/restart enabled debugging
|
||||
#
|
||||
AC_MSG_CHECKING([if want checkpoint/restart enabled debugging option])
|
||||
AC_ARG_ENABLE([crdebug],
|
||||
[AC_HELP_STRING([--enable-crdebug],
|
||||
[enable checkpoint/restart debugging functionality (default: disabled)])])
|
||||
|
||||
if test "$ompi_want_ft" = "0"; then
|
||||
ompi_want_prd=0
|
||||
AC_MSG_RESULT([Disabled (fault tolerance disabled --without-ft)])
|
||||
elif test "$enable_crdebug" = "yes"; then
|
||||
ompi_want_prd=1
|
||||
AC_MSG_RESULT([Enabled])
|
||||
else
|
||||
ompi_want_prd=0
|
||||
AC_MSG_RESULT([Disabled])
|
||||
fi
|
||||
|
||||
AC_DEFINE_UNQUOTED([OPAL_ENABLE_CRDEBUG], [$ompi_want_prd],
|
||||
[Whether we want checkpoint/restart enabled debugging functionality or not])
|
||||
|
||||
])dnl
|
||||
|
42
opal/mca/compress/Makefile.am
Обычный файл
42
opal/mca/compress/Makefile.am
Обычный файл
@ -0,0 +1,42 @@
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
include $(top_srcdir)/Makefile.man-page-rules
|
||||
|
||||
# main library setup
|
||||
noinst_LTLIBRARIES = libmca_compress.la
|
||||
libmca_compress_la_SOURCES =
|
||||
|
||||
# header setup
|
||||
nobase_opal_HEADERS =
|
||||
|
||||
# local files
|
||||
headers = compress.h
|
||||
libmca_compress_la_SOURCES += $(headers)
|
||||
|
||||
# Ensure that the man pages are rebuilt if the opal_config.h file
|
||||
# changes; a "good enough" way to know if configure was run again (and
|
||||
# therefore the release date or version may have changed)
|
||||
$(nodist_man_MANS): $(top_builddir)/opal/include/opal_config.h
|
||||
|
||||
# Conditionally install the header files
|
||||
if WANT_INSTALL_HEADERS
|
||||
nobase_opal_HEADERS += $(headers)
|
||||
opaldir = $(includedir)/openmpi/opal/mca/compress
|
||||
else
|
||||
opaldir = $(includedir)
|
||||
endif
|
||||
|
||||
include base/Makefile.am
|
||||
|
||||
distclean-local:
|
||||
rm -f base/static-components.h
|
||||
rm -f $(nodist_man_MANS)
|
21
opal/mca/compress/base/Makefile.am
Обычный файл
21
opal/mca/compress/base/Makefile.am
Обычный файл
@ -0,0 +1,21 @@
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
dist_pkgdata_DATA = base/help-opal-compress-base.txt
|
||||
|
||||
headers += \
|
||||
base/base.h
|
||||
|
||||
libmca_compress_la_SOURCES += \
|
||||
base/compress_base_open.c \
|
||||
base/compress_base_close.c \
|
||||
base/compress_base_select.c \
|
||||
base/compress_base_fns.c
|
76
opal/mca/compress/base/base.h
Обычный файл
76
opal/mca/compress/base/base.h
Обычный файл
@ -0,0 +1,76 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
#ifndef OPAL_COMPRESS_BASE_H
|
||||
#define OPAL_COMPRESS_BASE_H
|
||||
|
||||
#include "opal_config.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/util/opal_environ.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
|
||||
/*
|
||||
* Global functions for MCA overall COMPRESS
|
||||
*/
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Initialize the COMPRESS MCA framework
|
||||
*
|
||||
* @retval OPAL_SUCCESS Upon success
|
||||
* @retval OPAL_ERROR Upon failures
|
||||
*
|
||||
* This function is invoked during opal_init();
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_compress_base_open(void);
|
||||
|
||||
/**
|
||||
* Select an available component.
|
||||
*
|
||||
* @retval OPAL_SUCCESS Upon Success
|
||||
* @retval OPAL_NOT_FOUND If no component can be selected
|
||||
* @retval OPAL_ERROR Upon other failure
|
||||
*
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_compress_base_select(void);
|
||||
|
||||
/**
|
||||
* Finalize the COMPRESS MCA framework
|
||||
*
|
||||
* @retval OPAL_SUCCESS Upon success
|
||||
* @retval OPAL_ERROR Upon failures
|
||||
*
|
||||
* This function is invoked during opal_finalize();
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_compress_base_close(void);
|
||||
|
||||
/**
|
||||
* Globals
|
||||
*/
|
||||
OPAL_DECLSPEC extern int opal_compress_base_output;
|
||||
OPAL_DECLSPEC extern opal_list_t opal_compress_base_components_available;
|
||||
OPAL_DECLSPEC extern opal_compress_base_component_t opal_compress_base_selected_component;
|
||||
OPAL_DECLSPEC extern opal_compress_base_module_t opal_compress;
|
||||
|
||||
/**
|
||||
*
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_compress_base_tar_create(char ** target);
|
||||
OPAL_DECLSPEC int opal_compress_base_tar_extract(char ** target);
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif /* OPAL_COMPRESS_BASE_H */
|
40
opal/mca/compress/base/compress_base_close.c
Обычный файл
40
opal/mca/compress/base/compress_base_close.c
Обычный файл
@ -0,0 +1,40 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include <string.h>
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
#include "opal/include/opal/constants.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
|
||||
int opal_compress_base_close(void)
|
||||
{
|
||||
/* Compression currently only used with C/R */
|
||||
if( !opal_cr_is_enabled ) {
|
||||
opal_output_verbose(10, opal_compress_base_output,
|
||||
"compress:open: FT is not enabled, skipping!");
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/* Call the component's finalize routine */
|
||||
if( NULL != opal_compress.finalize ) {
|
||||
opal_compress.finalize();
|
||||
}
|
||||
|
||||
/* Close all available modules that are open */
|
||||
mca_base_components_close(opal_compress_base_output,
|
||||
&opal_compress_base_components_available,
|
||||
NULL);
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
142
opal/mca/compress/base/compress_base_fns.c
Обычный файл
142
opal/mca/compress/base/compress_base_fns.c
Обычный файл
@ -0,0 +1,142 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include <string.h>
|
||||
#include <sys/wait.h>
|
||||
#if HAVE_SYS_TYPES_H
|
||||
#include <sys/types.h>
|
||||
#endif
|
||||
#if HAVE_UNISTD_H
|
||||
#include <unistd.h>
|
||||
#endif
|
||||
#ifdef HAVE_FCNTL_H
|
||||
#include <fcntl.h>
|
||||
#endif /* HAVE_FCNTL_H */
|
||||
#ifdef HAVE_SYS_STAT_H
|
||||
#include <sys/stat.h>
|
||||
#endif
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
#include "opal/include/opal/constants.h"
|
||||
#include "opal/util/os_dirpath.h"
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/util/argv.h"
|
||||
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
|
||||
/******************
|
||||
* Local Function Defs
|
||||
******************/
|
||||
|
||||
/******************
|
||||
* Object stuff
|
||||
******************/
|
||||
|
||||
int opal_compress_base_tar_create(char ** target)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char *cmd = NULL;
|
||||
char *tar_target = NULL;
|
||||
char **argv = NULL;
|
||||
pid_t child_pid = 0;
|
||||
int status = 0;
|
||||
|
||||
asprintf(&tar_target, "%s.tar", *target);
|
||||
|
||||
child_pid = fork();
|
||||
if( 0 == child_pid ) { /* Child */
|
||||
asprintf(&cmd, "tar -cf %s %s", tar_target, *target);
|
||||
|
||||
argv = opal_argv_split(cmd, ' ');
|
||||
status = execvp(argv[0], argv);
|
||||
|
||||
opal_output(0, "compress:base: Tar:: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
else if(0 < child_pid) {
|
||||
waitpid(child_pid, &status, 0);
|
||||
|
||||
if( !WIFEXITED(status) ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
free(*target);
|
||||
*target = strdup(tar_target);
|
||||
}
|
||||
else {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
if( NULL != cmd ) {
|
||||
free(cmd);
|
||||
cmd = NULL;
|
||||
}
|
||||
if( NULL != tar_target ) {
|
||||
free(tar_target);
|
||||
tar_target = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
int opal_compress_base_tar_extract(char ** target)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char *cmd = NULL;
|
||||
char **argv = NULL;
|
||||
pid_t child_pid = 0;
|
||||
int status = 0;
|
||||
|
||||
child_pid = fork();
|
||||
if( 0 == child_pid ) { /* Child */
|
||||
asprintf(&cmd, "tar -xf %s", *target);
|
||||
|
||||
argv = opal_argv_split(cmd, ' ');
|
||||
status = execvp(argv[0], argv);
|
||||
|
||||
opal_output(0, "compress:base: Tar:: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
else if(0 < child_pid) {
|
||||
waitpid(child_pid, &status, 0);
|
||||
|
||||
if( !WIFEXITED(status) ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* Strip off the '.tar' */
|
||||
(*target)[strlen(*target)-4] = '\0';
|
||||
}
|
||||
else {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
if( NULL != cmd ) {
|
||||
free(cmd);
|
||||
cmd = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
/******************
|
||||
* Local Functions
|
||||
******************/
|
99
opal/mca/compress/base/compress_base_open.c
Обычный файл
99
opal/mca/compress/base/compress_base_open.c
Обычный файл
@ -0,0 +1,99 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include <string.h>
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
#include "opal/include/opal/constants.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#include "opal/util/output.h"
|
||||
|
||||
#include "opal/mca/compress/base/static-components.h"
|
||||
|
||||
/*
|
||||
* Globals
|
||||
*/
|
||||
int opal_compress_base_output = -1;
|
||||
opal_compress_base_module_t opal_compress = {
|
||||
NULL, /* init */
|
||||
NULL, /* finalize */
|
||||
NULL, /* compress */
|
||||
NULL, /* compress_nb */
|
||||
NULL, /* decompress */
|
||||
NULL /* decompress_nb */
|
||||
};
|
||||
opal_list_t opal_compress_base_components_available;
|
||||
opal_compress_base_component_t opal_compress_base_selected_component;
|
||||
|
||||
/**
|
||||
* Function for finding and opening either all MCA components,
|
||||
* or the one that was specifically requested via a MCA parameter.
|
||||
*/
|
||||
int opal_compress_base_open(void)
|
||||
{
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
int value;
|
||||
char *str_value = NULL;
|
||||
|
||||
/* Debugging/Verbose output */
|
||||
mca_base_param_reg_int_name("compress",
|
||||
"base_verbose",
|
||||
"Verbosity level of the COMPRESS framework",
|
||||
false, false,
|
||||
0, &value);
|
||||
if(0 != value) {
|
||||
opal_compress_base_output = opal_output_open(NULL);
|
||||
} else {
|
||||
opal_compress_base_output = -1;
|
||||
}
|
||||
opal_output_set_verbosity(opal_compress_base_output, value);
|
||||
|
||||
/*
|
||||
* Which COMPRESS component to open
|
||||
* - NULL or "" = auto-select
|
||||
* - "none" = Empty component
|
||||
* - ow. select that specific component
|
||||
*/
|
||||
mca_base_param_reg_string_name("compress", NULL,
|
||||
"Which COMPRESS component to use (empty = auto-select)",
|
||||
false, false,
|
||||
NULL, &str_value);
|
||||
|
||||
/* Compression currently only used with C/R */
|
||||
if( !opal_cr_is_enabled ) {
|
||||
opal_output_verbose(10, opal_compress_base_output,
|
||||
"compress:open: FT is not enabled, skipping!");
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/* Open up all available components */
|
||||
if (OPAL_SUCCESS != (ret = mca_base_components_open("compress",
|
||||
opal_compress_base_output,
|
||||
mca_compress_base_static_components,
|
||||
&opal_compress_base_components_available,
|
||||
true)) ) {
|
||||
if( OPAL_ERR_NOT_FOUND == ret &&
|
||||
NULL != str_value &&
|
||||
0 == strncmp(str_value, "none", strlen("none")) ) {
|
||||
exit_status = OPAL_SUCCESS;
|
||||
} else {
|
||||
exit_status = OPAL_ERROR;
|
||||
}
|
||||
}
|
||||
|
||||
if( NULL != str_value ) {
|
||||
free(str_value);
|
||||
}
|
||||
return exit_status;
|
||||
}
|
65
opal/mca/compress/base/compress_base_select.c
Обычный файл
65
opal/mca/compress/base/compress_base_select.c
Обычный файл
@ -0,0 +1,65 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#ifdef HAVE_UNISTD_H
|
||||
#include "unistd.h"
|
||||
#endif
|
||||
|
||||
#include "opal/include/opal/constants.h"
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
#include "opal/mca/base/mca_base_param.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
|
||||
int opal_compress_base_select(void)
|
||||
{
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
opal_compress_base_component_t *best_component = NULL;
|
||||
opal_compress_base_module_t *best_module = NULL;
|
||||
|
||||
/* Compression currently only used with C/R */
|
||||
if( !opal_cr_is_enabled ) {
|
||||
opal_output_verbose(10, opal_compress_base_output,
|
||||
"compress:open: FT is not enabled, skipping!");
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* Select the best component
|
||||
*/
|
||||
if( OPAL_SUCCESS != mca_base_select("compress", opal_compress_base_output,
|
||||
&opal_compress_base_components_available,
|
||||
(mca_base_module_t **) &best_module,
|
||||
(mca_base_component_t **) &best_component) ) {
|
||||
/* This will only happen if no component was selected */
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* Save the winner */
|
||||
opal_compress_base_selected_component = *best_component;
|
||||
opal_compress = *best_module;
|
||||
|
||||
/* Initialize the winner */
|
||||
if (NULL != best_module) {
|
||||
if (OPAL_SUCCESS != (ret = opal_compress.init()) ) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
|
||||
cleanup:
|
||||
return exit_status;
|
||||
}
|
13
opal/mca/compress/base/help-opal-compress-base.txt
Обычный файл
13
opal/mca/compress/base/help-opal-compress-base.txt
Обычный файл
@ -0,0 +1,13 @@
|
||||
-*- text -*-
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# This is the US/English general help file for Open PAL Compress framework.
|
||||
#
|
40
opal/mca/compress/bzip/Makefile.am
Обычный файл
40
opal/mca/compress/bzip/Makefile.am
Обычный файл
@ -0,0 +1,40 @@
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
AM_CPPFLAGS = \
|
||||
$(LTDLINCL)
|
||||
|
||||
dist_pkgdata_DATA = help-opal-compress-bzip.txt
|
||||
|
||||
sources = \
|
||||
compress_bzip.h \
|
||||
compress_bzip_component.c \
|
||||
compress_bzip_module.c
|
||||
|
||||
# Make the output library in this directory, and name it either
|
||||
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||
# (for static builds).
|
||||
|
||||
if OMPI_BUILD_compress_bzip_DSO
|
||||
component_noinst =
|
||||
component_install = mca_compress_bzip.la
|
||||
else
|
||||
component_noinst = libmca_compress_bzip.la
|
||||
component_install =
|
||||
endif
|
||||
|
||||
mcacomponentdir = $(pkglibdir)
|
||||
mcacomponent_LTLIBRARIES = $(component_install)
|
||||
mca_compress_bzip_la_SOURCES = $(sources)
|
||||
mca_compress_bzip_la_LDFLAGS = -module -avoid-version
|
||||
|
||||
noinst_LTLIBRARIES = $(component_noinst)
|
||||
libmca_compress_bzip_la_SOURCES = $(sources)
|
||||
libmca_compress_bzip_la_LDFLAGS = -module -avoid-version
|
63
opal/mca/compress/bzip/compress_bzip.h
Обычный файл
63
opal/mca/compress/bzip/compress_bzip.h
Обычный файл
@ -0,0 +1,63 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
/**
|
||||
* @file
|
||||
*
|
||||
* BZIP COMPRESS component
|
||||
*
|
||||
* Uses the bzip library
|
||||
*/
|
||||
|
||||
#ifndef MCA_COMPRESS_BZIP_EXPORT_H
|
||||
#define MCA_COMPRESS_BZIP_EXPORT_H
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include "opal/util/output.h"
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Local Component structures
|
||||
*/
|
||||
struct opal_compress_bzip_component_t {
|
||||
opal_compress_base_component_t super; /** Base COMPRESS component */
|
||||
|
||||
};
|
||||
typedef struct opal_compress_bzip_component_t opal_compress_bzip_component_t;
|
||||
OPAL_MODULE_DECLSPEC extern opal_compress_bzip_component_t mca_compress_bzip_component;
|
||||
|
||||
int opal_compress_bzip_component_query(mca_base_module_t **module, int *priority);
|
||||
|
||||
/*
|
||||
* Module functions
|
||||
*/
|
||||
int opal_compress_bzip_module_init(void);
|
||||
int opal_compress_bzip_module_finalize(void);
|
||||
|
||||
/*
|
||||
* Actual funcationality
|
||||
*/
|
||||
int opal_compress_bzip_compress(char *fname, char **cname, char **postfix);
|
||||
int opal_compress_bzip_compress_nb(char *fname, char **cname, char **postfix, pid_t *child_pid);
|
||||
int opal_compress_bzip_decompress(char *cname, char **fname);
|
||||
int opal_compress_bzip_decompress_nb(char *cname, char **fname, pid_t *child_pid);
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif /* MCA_COMPRESS_BZIP_EXPORT_H */
|
138
opal/mca/compress/bzip/compress_bzip_component.c
Обычный файл
138
opal/mca/compress/bzip/compress_bzip_component.c
Обычный файл
@ -0,0 +1,138 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include "opal/constants.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#include "compress_bzip.h"
|
||||
|
||||
/*
|
||||
* Public string for version number
|
||||
*/
|
||||
const char *opal_compress_bzip_component_version_string =
|
||||
"OPAL COMPRESS bzip MCA component version " OPAL_VERSION;
|
||||
|
||||
/*
|
||||
* Local functionality
|
||||
*/
|
||||
static int compress_bzip_open(void);
|
||||
static int compress_bzip_close(void);
|
||||
|
||||
/*
|
||||
* Instantiate the public struct with all of our public information
|
||||
* and pointer to our public functions in it
|
||||
*/
|
||||
opal_compress_bzip_component_t mca_compress_bzip_component = {
|
||||
/* First do the base component stuff */
|
||||
{
|
||||
/* Handle the general mca_component_t struct containing
|
||||
* meta information about the component itbzip
|
||||
*/
|
||||
{
|
||||
OPAL_COMPRESS_BASE_VERSION_2_0_0,
|
||||
|
||||
/* Component name and version */
|
||||
"bzip",
|
||||
OPAL_MAJOR_VERSION,
|
||||
OPAL_MINOR_VERSION,
|
||||
OPAL_RELEASE_VERSION,
|
||||
|
||||
/* Component open and close functions */
|
||||
compress_bzip_open,
|
||||
compress_bzip_close,
|
||||
opal_compress_bzip_component_query
|
||||
},
|
||||
{
|
||||
/* The component is checkpoint ready */
|
||||
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||
},
|
||||
|
||||
/* Verbosity level */
|
||||
0,
|
||||
/* opal_output handler */
|
||||
-1,
|
||||
/* Default priority */
|
||||
10
|
||||
}
|
||||
};
|
||||
|
||||
/*
|
||||
* Bzip module
|
||||
*/
|
||||
static opal_compress_base_module_t loc_module = {
|
||||
/** Initialization Function */
|
||||
opal_compress_bzip_module_init,
|
||||
/** Finalization Function */
|
||||
opal_compress_bzip_module_finalize,
|
||||
|
||||
/** Compress Function */
|
||||
opal_compress_bzip_compress,
|
||||
opal_compress_bzip_compress_nb,
|
||||
|
||||
/** Decompress Function */
|
||||
opal_compress_bzip_decompress,
|
||||
opal_compress_bzip_decompress_nb
|
||||
};
|
||||
|
||||
static int compress_bzip_open(void)
|
||||
{
|
||||
mca_base_param_reg_int(&mca_compress_bzip_component.super.base_version,
|
||||
"priority",
|
||||
"Priority of the COMPRESS bzip component",
|
||||
false, false,
|
||||
mca_compress_bzip_component.super.priority,
|
||||
&mca_compress_bzip_component.super.priority);
|
||||
|
||||
mca_base_param_reg_int(&mca_compress_bzip_component.super.base_version,
|
||||
"verbose",
|
||||
"Verbose level for the COMPRESS bzip component",
|
||||
false, false,
|
||||
mca_compress_bzip_component.super.verbose,
|
||||
&mca_compress_bzip_component.super.verbose);
|
||||
/* If there is a custom verbose level for this component than use it
|
||||
* otherwise take our parents level and output channel
|
||||
*/
|
||||
if ( 0 != mca_compress_bzip_component.super.verbose) {
|
||||
mca_compress_bzip_component.super.output_handle = opal_output_open(NULL);
|
||||
opal_output_set_verbosity(mca_compress_bzip_component.super.output_handle,
|
||||
mca_compress_bzip_component.super.verbose);
|
||||
} else {
|
||||
mca_compress_bzip_component.super.output_handle = opal_compress_base_output;
|
||||
}
|
||||
|
||||
/*
|
||||
* Debug output
|
||||
*/
|
||||
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: open()");
|
||||
opal_output_verbose(20, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: open: priority = %d",
|
||||
mca_compress_bzip_component.super.priority);
|
||||
opal_output_verbose(20, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: open: verbosity = %d",
|
||||
mca_compress_bzip_component.super.verbose);
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
static int compress_bzip_close(void)
|
||||
{
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_bzip_component_query(mca_base_module_t **module, int *priority)
|
||||
{
|
||||
*module = (mca_base_module_t *)&loc_module;
|
||||
*priority = mca_compress_bzip_component.super.priority;
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
247
opal/mca/compress/bzip/compress_bzip_module.c
Обычный файл
247
opal/mca/compress/bzip/compress_bzip_module.c
Обычный файл
@ -0,0 +1,247 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include <string.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/wait.h>
|
||||
#include <sys/stat.h>
|
||||
#if HAVE_UNISTD_H
|
||||
#include <unistd.h>
|
||||
#endif /* HAVE_UNISTD_H */
|
||||
|
||||
#include "opal/util/opal_environ.h"
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/util/show_help.h"
|
||||
#include "opal/util/argv.h"
|
||||
#include "opal/util/opal_environ.h"
|
||||
|
||||
#include "opal/constants.h"
|
||||
#include "opal/mca/base/mca_base_param.h"
|
||||
#include "opal/util/basename.h"
|
||||
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
|
||||
#include "compress_bzip.h"
|
||||
|
||||
static bool is_directory(char *fname );
|
||||
|
||||
int opal_compress_bzip_module_init(void)
|
||||
{
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_bzip_module_finalize(void)
|
||||
{
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_bzip_compress(char * fname, char **cname, char **postfix)
|
||||
{
|
||||
int child_pid = 0;
|
||||
int status = 0;
|
||||
|
||||
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: compress(%s)",
|
||||
fname);
|
||||
|
||||
opal_compress_bzip_compress_nb(fname, cname, postfix, &child_pid);
|
||||
waitpid(child_pid, &status, 0);
|
||||
|
||||
if( WIFEXITED(status) ) {
|
||||
return OPAL_SUCCESS;
|
||||
} else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
}
|
||||
|
||||
int opal_compress_bzip_compress_nb(char * fname, char **cname, char **postfix, pid_t *child_pid)
|
||||
{
|
||||
char * cmd = NULL;
|
||||
char **argv = NULL;
|
||||
char * base_fname = NULL;
|
||||
char * dir_fname = NULL;
|
||||
int status;
|
||||
bool is_dir;
|
||||
|
||||
is_dir = is_directory(fname);
|
||||
|
||||
*child_pid = fork();
|
||||
if( *child_pid == 0 ) { /* Child */
|
||||
|
||||
dir_fname = opal_dirname(fname);
|
||||
base_fname = opal_basename(fname);
|
||||
|
||||
chdir(dir_fname);
|
||||
|
||||
if( is_dir ) {
|
||||
#if 0
|
||||
opal_compress_base_tar_create(&base_fname);
|
||||
asprintf(cname, "%s.bz2", base_fname);
|
||||
asprintf(&cmd, "bzip2 %s", base_fname);
|
||||
#else
|
||||
asprintf(cname, "%s.tar.bz2", base_fname);
|
||||
asprintf(&cmd, "tar -jcf %s %s", *cname, base_fname);
|
||||
#endif
|
||||
} else {
|
||||
asprintf(cname, "%s.bz2", base_fname);
|
||||
asprintf(&cmd, "bzip2 %s", base_fname);
|
||||
}
|
||||
|
||||
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: compress_nb(%s -> [%s])",
|
||||
fname, *cname);
|
||||
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: compress_nb() command [%s]",
|
||||
cmd);
|
||||
|
||||
argv = opal_argv_split(cmd, ' ');
|
||||
status = execvp(argv[0], argv);
|
||||
|
||||
opal_output(0, "compress:bzip: compress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
else if( *child_pid > 0 ) {
|
||||
if( is_dir ) {
|
||||
*postfix = strdup(".tar.bz2");
|
||||
} else {
|
||||
*postfix = strdup(".bz2");
|
||||
}
|
||||
asprintf(cname, "%s%s", fname, *postfix);
|
||||
}
|
||||
else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
if( NULL != cmd ) {
|
||||
free(cmd);
|
||||
cmd = NULL;
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_bzip_decompress(char * cname, char **fname)
|
||||
{
|
||||
int child_pid = 0;
|
||||
int status = 0;
|
||||
|
||||
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: decompress(%s)",
|
||||
cname);
|
||||
|
||||
opal_compress_bzip_decompress_nb(cname, fname, &child_pid);
|
||||
waitpid(child_pid, &status, 0);
|
||||
|
||||
if( WIFEXITED(status) ) {
|
||||
return OPAL_SUCCESS;
|
||||
} else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
}
|
||||
|
||||
int opal_compress_bzip_decompress_nb(char * cname, char **fname, pid_t *child_pid)
|
||||
{
|
||||
char * cmd = NULL;
|
||||
char **argv = NULL;
|
||||
char * dir_cname = NULL;
|
||||
pid_t loc_pid = 0;
|
||||
int status;
|
||||
bool is_tar;
|
||||
|
||||
if( 0 == strncmp(&(cname[strlen(cname)-8]), ".tar.bz2", strlen(".tar.bz2")) ) {
|
||||
is_tar = true;
|
||||
}
|
||||
|
||||
*fname = strdup(cname);
|
||||
if( is_tar ) {
|
||||
(*fname)[strlen(cname)-8] = '\0';
|
||||
} else {
|
||||
(*fname)[strlen(cname)-4] = '\0';
|
||||
}
|
||||
|
||||
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: decompress_nb(%s -> [%s])",
|
||||
cname, *fname);
|
||||
|
||||
*child_pid = fork();
|
||||
if( *child_pid == 0 ) { /* Child */
|
||||
dir_cname = opal_dirname(cname);
|
||||
|
||||
chdir(dir_cname);
|
||||
|
||||
/* Fork(bunzip) */
|
||||
loc_pid = fork();
|
||||
if( loc_pid == 0 ) { /* Child */
|
||||
asprintf(&cmd, "bunzip2 %s", cname);
|
||||
|
||||
opal_output_verbose(10, mca_compress_bzip_component.super.output_handle,
|
||||
"compress:bzip: decompress_nb() command [%s]",
|
||||
cmd);
|
||||
|
||||
argv = opal_argv_split(cmd, ' ');
|
||||
status = execvp(argv[0], argv);
|
||||
|
||||
opal_output(0, "compress:bzip: decompress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
else if( loc_pid > 0 ) { /* Parent */
|
||||
waitpid(loc_pid, &status, 0);
|
||||
if( !WIFEXITED(status) ) {
|
||||
opal_output(0, "compress:bzip: decompress_nb: Failed to bunzip the file [%s] status = %d\n", cname, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
}
|
||||
else {
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
|
||||
/* tar_decompress */
|
||||
if( is_tar ) {
|
||||
/* Strip off '.bz2' leaving just '.tar' */
|
||||
cname[strlen(cname)-4] = '\0';
|
||||
opal_compress_base_tar_extract(&cname);
|
||||
}
|
||||
|
||||
/* Once this child is done, then directly exit */
|
||||
exit(OPAL_SUCCESS);
|
||||
}
|
||||
else if( *child_pid > 0 ) {
|
||||
;
|
||||
}
|
||||
else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
if( NULL != cmd ) {
|
||||
free(cmd);
|
||||
cmd = NULL;
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
static bool is_directory(char *fname ) {
|
||||
struct stat file_status;
|
||||
int rc;
|
||||
|
||||
if(0 != (rc = stat(fname, &file_status) ) ) {
|
||||
return false;
|
||||
}
|
||||
if(S_ISDIR(file_status.st_mode)) {
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
13
opal/mca/compress/bzip/configure.params
Обычный файл
13
opal/mca/compress/bzip/configure.params
Обычный файл
@ -0,0 +1,13 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
PARAM_INIT_FILE=compress_bzip_component.c
|
||||
PARAM_CONFIG_FILES="Makefile"
|
13
opal/mca/compress/bzip/help-opal-compress-bzip.txt
Обычный файл
13
opal/mca/compress/bzip/help-opal-compress-bzip.txt
Обычный файл
@ -0,0 +1,13 @@
|
||||
-*- text -*-
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# This is the US/English general help file for Open PAL Compress framework.
|
||||
#
|
135
opal/mca/compress/compress.h
Обычный файл
135
opal/mca/compress/compress.h
Обычный файл
@ -0,0 +1,135 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
/**
|
||||
* @file
|
||||
*
|
||||
* Compression Framework
|
||||
*
|
||||
* General Description:
|
||||
*
|
||||
* The OPAL Compress framework has been created to provide an abstract interface
|
||||
* to the compression agent library on the host machine. This fromework is useful
|
||||
* when distributing files that can be compressed before sending to dimish the
|
||||
* load on the network.
|
||||
*
|
||||
*/
|
||||
|
||||
#ifndef MCA_COMPRESS_H
|
||||
#define MCA_COMPRESS_H
|
||||
|
||||
#include "opal_config.h"
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
#include "opal/class/opal_object.h"
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
/**
|
||||
* Module initialization function.
|
||||
* Returns OPAL_SUCCESS
|
||||
*/
|
||||
typedef int (*opal_compress_base_module_init_fn_t)
|
||||
(void);
|
||||
|
||||
/**
|
||||
* Module finalization function.
|
||||
* Returns OPAL_SUCCESS
|
||||
*/
|
||||
typedef int (*opal_compress_base_module_finalize_fn_t)
|
||||
(void);
|
||||
|
||||
/**
|
||||
* Compress the file provided
|
||||
*
|
||||
* Arguments:
|
||||
* fname = Filename to compress
|
||||
* cname = Compressed filename
|
||||
* postfix = postfix added to filename to create compressed filename
|
||||
* Returns:
|
||||
* OPAL_SUCCESS on success, ow OPAL_ERROR
|
||||
*/
|
||||
typedef int (*opal_compress_base_module_compress_fn_t)
|
||||
(char * fname, char **cname, char **postfix);
|
||||
|
||||
typedef int (*opal_compress_base_module_compress_nb_fn_t)
|
||||
(char * fname, char **cname, char **postfix, pid_t *child_pid);
|
||||
|
||||
/**
|
||||
* Decompress the file provided
|
||||
*
|
||||
* Arguments:
|
||||
* fname = Filename to compress
|
||||
* cname = Compressed filename
|
||||
* Returns:
|
||||
* OPAL_SUCCESS on success, ow OPAL_ERROR
|
||||
*/
|
||||
typedef int (*opal_compress_base_module_decompress_fn_t)
|
||||
(char * cname, char **fname);
|
||||
typedef int (*opal_compress_base_module_decompress_nb_fn_t)
|
||||
(char * cname, char **fname, pid_t *child_pid);
|
||||
|
||||
/**
|
||||
* Structure for COMPRESS components.
|
||||
*/
|
||||
struct opal_compress_base_component_2_0_0_t {
|
||||
/** MCA base component */
|
||||
mca_base_component_t base_version;
|
||||
/** MCA base data */
|
||||
mca_base_component_data_t base_data;
|
||||
|
||||
/** Verbosity Level */
|
||||
int verbose;
|
||||
/** Output Handle for opal_output */
|
||||
int output_handle;
|
||||
/** Default Priority */
|
||||
int priority;
|
||||
};
|
||||
typedef struct opal_compress_base_component_2_0_0_t opal_compress_base_component_2_0_0_t;
|
||||
typedef struct opal_compress_base_component_2_0_0_t opal_compress_base_component_t;
|
||||
|
||||
/**
|
||||
* Structure for COMPRESS modules
|
||||
*/
|
||||
struct opal_compress_base_module_1_0_0_t {
|
||||
/** Initialization Function */
|
||||
opal_compress_base_module_init_fn_t init;
|
||||
/** Finalization Function */
|
||||
opal_compress_base_module_finalize_fn_t finalize;
|
||||
|
||||
/** Compress interface */
|
||||
opal_compress_base_module_compress_fn_t compress;
|
||||
opal_compress_base_module_compress_nb_fn_t compress_nb;
|
||||
|
||||
/** Decompress Interface */
|
||||
opal_compress_base_module_decompress_fn_t decompress;
|
||||
opal_compress_base_module_decompress_nb_fn_t decompress_nb;
|
||||
};
|
||||
typedef struct opal_compress_base_module_1_0_0_t opal_compress_base_module_1_0_0_t;
|
||||
typedef struct opal_compress_base_module_1_0_0_t opal_compress_base_module_t;
|
||||
|
||||
OPAL_DECLSPEC extern opal_compress_base_module_t opal_compress;
|
||||
|
||||
/**
|
||||
* Macro for use in components that are of type COMPRESS
|
||||
*/
|
||||
#define OPAL_COMPRESS_BASE_VERSION_2_0_0 \
|
||||
MCA_BASE_VERSION_2_0_0, \
|
||||
"compress", 2, 0, 0
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif /* OPAL_COMPRESS_H */
|
||||
|
40
opal/mca/compress/gzip/Makefile.am
Обычный файл
40
opal/mca/compress/gzip/Makefile.am
Обычный файл
@ -0,0 +1,40 @@
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
AM_CPPFLAGS = \
|
||||
$(LTDLINCL)
|
||||
|
||||
dist_pkgdata_DATA = help-opal-compress-gzip.txt
|
||||
|
||||
sources = \
|
||||
compress_gzip.h \
|
||||
compress_gzip_component.c \
|
||||
compress_gzip_module.c
|
||||
|
||||
# Make the output library in this directory, and name it either
|
||||
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||
# (for static builds).
|
||||
|
||||
if OMPI_BUILD_compress_gzip_DSO
|
||||
component_noinst =
|
||||
component_install = mca_compress_gzip.la
|
||||
else
|
||||
component_noinst = libmca_compress_gzip.la
|
||||
component_install =
|
||||
endif
|
||||
|
||||
mcacomponentdir = $(pkglibdir)
|
||||
mcacomponent_LTLIBRARIES = $(component_install)
|
||||
mca_compress_gzip_la_SOURCES = $(sources)
|
||||
mca_compress_gzip_la_LDFLAGS = -module -avoid-version
|
||||
|
||||
noinst_LTLIBRARIES = $(component_noinst)
|
||||
libmca_compress_gzip_la_SOURCES = $(sources)
|
||||
libmca_compress_gzip_la_LDFLAGS = -module -avoid-version
|
63
opal/mca/compress/gzip/compress_gzip.h
Обычный файл
63
opal/mca/compress/gzip/compress_gzip.h
Обычный файл
@ -0,0 +1,63 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
/**
|
||||
* @file
|
||||
*
|
||||
* GZIP COMPRESS component
|
||||
*
|
||||
* Uses the gzip library
|
||||
*/
|
||||
|
||||
#ifndef MCA_COMPRESS_GZIP_EXPORT_H
|
||||
#define MCA_COMPRESS_GZIP_EXPORT_H
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include "opal/util/output.h"
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
extern "C" {
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Local Component structures
|
||||
*/
|
||||
struct opal_compress_gzip_component_t {
|
||||
opal_compress_base_component_t super; /** Base COMPRESS component */
|
||||
|
||||
};
|
||||
typedef struct opal_compress_gzip_component_t opal_compress_gzip_component_t;
|
||||
OPAL_MODULE_DECLSPEC extern opal_compress_gzip_component_t mca_compress_gzip_component;
|
||||
|
||||
int opal_compress_gzip_component_query(mca_base_module_t **module, int *priority);
|
||||
|
||||
/*
|
||||
* Module functions
|
||||
*/
|
||||
int opal_compress_gzip_module_init(void);
|
||||
int opal_compress_gzip_module_finalize(void);
|
||||
|
||||
/*
|
||||
* Actual funcationality
|
||||
*/
|
||||
int opal_compress_gzip_compress(char *fname, char **cname, char **postfix);
|
||||
int opal_compress_gzip_compress_nb(char *fname, char **cname, char **postfix, pid_t *child_pid);
|
||||
int opal_compress_gzip_decompress(char *cname, char **fname);
|
||||
int opal_compress_gzip_decompress_nb(char *cname, char **fname, pid_t *child_pid);
|
||||
|
||||
#if defined(c_plusplus) || defined(__cplusplus)
|
||||
}
|
||||
#endif
|
||||
|
||||
#endif /* MCA_COMPRESS_GZIP_EXPORT_H */
|
138
opal/mca/compress/gzip/compress_gzip_component.c
Обычный файл
138
opal/mca/compress/gzip/compress_gzip_component.c
Обычный файл
@ -0,0 +1,138 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include "opal/constants.h"
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#include "compress_gzip.h"
|
||||
|
||||
/*
|
||||
* Public string for version number
|
||||
*/
|
||||
const char *opal_compress_gzip_component_version_string =
|
||||
"OPAL COMPRESS gzip MCA component version " OPAL_VERSION;
|
||||
|
||||
/*
|
||||
* Local functionality
|
||||
*/
|
||||
static int compress_gzip_open(void);
|
||||
static int compress_gzip_close(void);
|
||||
|
||||
/*
|
||||
* Instantiate the public struct with all of our public information
|
||||
* and pointer to our public functions in it
|
||||
*/
|
||||
opal_compress_gzip_component_t mca_compress_gzip_component = {
|
||||
/* First do the base component stuff */
|
||||
{
|
||||
/* Handle the general mca_component_t struct containing
|
||||
* meta information about the component itgzip
|
||||
*/
|
||||
{
|
||||
OPAL_COMPRESS_BASE_VERSION_2_0_0,
|
||||
|
||||
/* Component name and version */
|
||||
"gzip",
|
||||
OPAL_MAJOR_VERSION,
|
||||
OPAL_MINOR_VERSION,
|
||||
OPAL_RELEASE_VERSION,
|
||||
|
||||
/* Component open and close functions */
|
||||
compress_gzip_open,
|
||||
compress_gzip_close,
|
||||
opal_compress_gzip_component_query
|
||||
},
|
||||
{
|
||||
/* The component is checkpoint ready */
|
||||
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||
},
|
||||
|
||||
/* Verbosity level */
|
||||
0,
|
||||
/* opal_output handler */
|
||||
-1,
|
||||
/* Default priority */
|
||||
15
|
||||
}
|
||||
};
|
||||
|
||||
/*
|
||||
* Gzip module
|
||||
*/
|
||||
static opal_compress_base_module_t loc_module = {
|
||||
/** Initialization Function */
|
||||
opal_compress_gzip_module_init,
|
||||
/** Finalization Function */
|
||||
opal_compress_gzip_module_finalize,
|
||||
|
||||
/** Compress Function */
|
||||
opal_compress_gzip_compress,
|
||||
opal_compress_gzip_compress_nb,
|
||||
|
||||
/** Decompress Function */
|
||||
opal_compress_gzip_decompress,
|
||||
opal_compress_gzip_decompress_nb
|
||||
};
|
||||
|
||||
static int compress_gzip_open(void)
|
||||
{
|
||||
mca_base_param_reg_int(&mca_compress_gzip_component.super.base_version,
|
||||
"priority",
|
||||
"Priority of the COMPRESS gzip component",
|
||||
false, false,
|
||||
mca_compress_gzip_component.super.priority,
|
||||
&mca_compress_gzip_component.super.priority);
|
||||
|
||||
mca_base_param_reg_int(&mca_compress_gzip_component.super.base_version,
|
||||
"verbose",
|
||||
"Verbose level for the COMPRESS gzip component",
|
||||
false, false,
|
||||
mca_compress_gzip_component.super.verbose,
|
||||
&mca_compress_gzip_component.super.verbose);
|
||||
/* If there is a custom verbose level for this component than use it
|
||||
* otherwise take our parents level and output channel
|
||||
*/
|
||||
if ( 0 != mca_compress_gzip_component.super.verbose) {
|
||||
mca_compress_gzip_component.super.output_handle = opal_output_open(NULL);
|
||||
opal_output_set_verbosity(mca_compress_gzip_component.super.output_handle,
|
||||
mca_compress_gzip_component.super.verbose);
|
||||
} else {
|
||||
mca_compress_gzip_component.super.output_handle = opal_compress_base_output;
|
||||
}
|
||||
|
||||
/*
|
||||
* Debug output
|
||||
*/
|
||||
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: open()");
|
||||
opal_output_verbose(20, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: open: priority = %d",
|
||||
mca_compress_gzip_component.super.priority);
|
||||
opal_output_verbose(20, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: open: verbosity = %d",
|
||||
mca_compress_gzip_component.super.verbose);
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
static int compress_gzip_close(void)
|
||||
{
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_gzip_component_query(mca_base_module_t **module, int *priority)
|
||||
{
|
||||
*module = (mca_base_module_t *)&loc_module;
|
||||
*priority = mca_compress_gzip_component.super.priority;
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
250
opal/mca/compress/gzip/compress_gzip_module.c
Обычный файл
250
opal/mca/compress/gzip/compress_gzip_module.c
Обычный файл
@ -0,0 +1,250 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "opal_config.h"
|
||||
|
||||
#include <string.h>
|
||||
#include <sys/types.h>
|
||||
#include <sys/wait.h>
|
||||
#include <sys/stat.h>
|
||||
#if HAVE_UNISTD_H
|
||||
#include <unistd.h>
|
||||
#endif /* HAVE_UNISTD_H */
|
||||
|
||||
#include "opal/util/opal_environ.h"
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/util/show_help.h"
|
||||
#include "opal/util/argv.h"
|
||||
#include "opal/util/opal_environ.h"
|
||||
|
||||
#include "opal/constants.h"
|
||||
#include "opal/mca/base/mca_base_param.h"
|
||||
#include "opal/util/basename.h"
|
||||
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
|
||||
#include "compress_gzip.h"
|
||||
|
||||
static bool is_directory(char *fname );
|
||||
|
||||
int opal_compress_gzip_module_init(void)
|
||||
{
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_gzip_module_finalize(void)
|
||||
{
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_gzip_compress(char * fname, char **cname, char **postfix)
|
||||
{
|
||||
int child_pid = 0;
|
||||
int status = 0;
|
||||
|
||||
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: compress(%s)",
|
||||
fname);
|
||||
|
||||
opal_compress_gzip_compress_nb(fname, cname, postfix, &child_pid);
|
||||
waitpid(child_pid, &status, 0);
|
||||
|
||||
if( WIFEXITED(status) ) {
|
||||
return OPAL_SUCCESS;
|
||||
} else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
}
|
||||
|
||||
int opal_compress_gzip_compress_nb(char * fname, char **cname, char **postfix, pid_t *child_pid)
|
||||
{
|
||||
char * cmd = NULL;
|
||||
char **argv = NULL;
|
||||
char * base_fname = NULL;
|
||||
char * dir_fname = NULL;
|
||||
int status;
|
||||
bool is_dir;
|
||||
|
||||
is_dir = is_directory(fname);
|
||||
|
||||
*child_pid = fork();
|
||||
if( *child_pid == 0 ) { /* Child */
|
||||
|
||||
dir_fname = opal_dirname(fname);
|
||||
base_fname = opal_basename(fname);
|
||||
|
||||
chdir(dir_fname);
|
||||
|
||||
if( is_dir ) {
|
||||
#if 0
|
||||
opal_compress_base_tar_create(&base_fname);
|
||||
asprintf(cname, "%s.gz", base_fname);
|
||||
asprintf(&cmd, "gzip %s", base_fname);
|
||||
#else
|
||||
asprintf(cname, "%s.tar.gz", base_fname);
|
||||
asprintf(&cmd, "tar -zcf %s %s", *cname, base_fname);
|
||||
#endif
|
||||
} else {
|
||||
asprintf(cname, "%s.gz", base_fname);
|
||||
asprintf(&cmd, "gzip %s", base_fname);
|
||||
}
|
||||
|
||||
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: compress_nb(%s -> [%s])",
|
||||
fname, *cname);
|
||||
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: compress_nb() command [%s]",
|
||||
cmd);
|
||||
|
||||
argv = opal_argv_split(cmd, ' ');
|
||||
status = execvp(argv[0], argv);
|
||||
|
||||
opal_output(0, "compress:gzip: compress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
else if( *child_pid > 0 ) {
|
||||
if( is_dir ) {
|
||||
*postfix = strdup(".tar.gz");
|
||||
} else {
|
||||
*postfix = strdup(".gz");
|
||||
}
|
||||
asprintf(cname, "%s%s", fname, *postfix);
|
||||
|
||||
}
|
||||
else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
if( NULL != cmd ) {
|
||||
free(cmd);
|
||||
cmd = NULL;
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_compress_gzip_decompress(char * cname, char **fname)
|
||||
{
|
||||
int child_pid = 0;
|
||||
int status = 0;
|
||||
|
||||
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: decompress(%s)",
|
||||
cname);
|
||||
|
||||
opal_compress_gzip_decompress_nb(cname, fname, &child_pid);
|
||||
waitpid(child_pid, &status, 0);
|
||||
|
||||
if( WIFEXITED(status) ) {
|
||||
return OPAL_SUCCESS;
|
||||
} else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
}
|
||||
|
||||
int opal_compress_gzip_decompress_nb(char * cname, char **fname, pid_t *child_pid)
|
||||
{
|
||||
char * cmd = NULL;
|
||||
char **argv = NULL;
|
||||
char * dir_cname = NULL;
|
||||
pid_t loc_pid = 0;
|
||||
int status;
|
||||
bool is_tar;
|
||||
|
||||
if( 0 == strncmp(&(cname[strlen(cname)-7]), ".tar.gz", strlen(".tar.gz")) ) {
|
||||
is_tar = true;
|
||||
}
|
||||
|
||||
*fname = strdup(cname);
|
||||
if( is_tar ) {
|
||||
/* Strip off '.tar.gz' */
|
||||
(*fname)[strlen(cname)-7] = '\0';
|
||||
} else {
|
||||
/* Strip off '.gz' */
|
||||
(*fname)[strlen(cname)-3] = '\0';
|
||||
}
|
||||
|
||||
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: decompress_nb(%s -> [%s])",
|
||||
cname, *fname);
|
||||
|
||||
*child_pid = fork();
|
||||
if( *child_pid == 0 ) { /* Child */
|
||||
dir_cname = opal_dirname(cname);
|
||||
|
||||
chdir(dir_cname);
|
||||
|
||||
/* Fork(gunzip) */
|
||||
loc_pid = fork();
|
||||
if( loc_pid == 0 ) { /* Child */
|
||||
asprintf(&cmd, "gunzip %s", cname);
|
||||
|
||||
opal_output_verbose(10, mca_compress_gzip_component.super.output_handle,
|
||||
"compress:gzip: decompress_nb() command [%s]",
|
||||
cmd);
|
||||
|
||||
argv = opal_argv_split(cmd, ' ');
|
||||
status = execvp(argv[0], argv);
|
||||
|
||||
opal_output(0, "compress:gzip: decompress_nb: Failed to exec child [%s] status = %d\n", cmd, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
else if( loc_pid > 0 ) { /* Parent */
|
||||
waitpid(loc_pid, &status, 0);
|
||||
if( !WIFEXITED(status) ) {
|
||||
opal_output(0, "compress:gzip: decompress_nb: Failed to bunzip the file [%s] status = %d\n", cname, status);
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
}
|
||||
else {
|
||||
exit(OPAL_ERROR);
|
||||
}
|
||||
|
||||
/* tar_decompress */
|
||||
if( is_tar ) {
|
||||
/* Strip off '.gz' leaving just '.tar' */
|
||||
cname[strlen(cname)-3] = '\0';
|
||||
opal_compress_base_tar_extract(&cname);
|
||||
}
|
||||
|
||||
/* Once this child is done, then directly exit */
|
||||
exit(OPAL_SUCCESS);
|
||||
}
|
||||
else if( *child_pid > 0 ) {
|
||||
;
|
||||
}
|
||||
else {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
if( NULL != cmd ) {
|
||||
free(cmd);
|
||||
cmd = NULL;
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
static bool is_directory(char *fname ) {
|
||||
struct stat file_status;
|
||||
int rc;
|
||||
|
||||
if(0 != (rc = stat(fname, &file_status) ) ) {
|
||||
return false;
|
||||
}
|
||||
if(S_ISDIR(file_status.st_mode)) {
|
||||
return true;
|
||||
}
|
||||
|
||||
return false;
|
||||
}
|
13
opal/mca/compress/gzip/configure.params
Обычный файл
13
opal/mca/compress/gzip/configure.params
Обычный файл
@ -0,0 +1,13 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
PARAM_INIT_FILE=compress_gzip_component.c
|
||||
PARAM_CONFIG_FILES="Makefile"
|
13
opal/mca/compress/gzip/help-opal-compress-gzip.txt
Обычный файл
13
opal/mca/compress/gzip/help-opal-compress-gzip.txt
Обычный файл
@ -0,0 +1,13 @@
|
||||
-*- text -*-
|
||||
#
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# This is the US/English general help file for Open PAL Compress framework.
|
||||
#
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -23,6 +23,7 @@
|
||||
#include "opal_config.h"
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/util/opal_environ.h"
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
|
||||
/*
|
||||
* Global functions for MCA overall CRS
|
||||
@ -32,7 +33,7 @@ BEGIN_C_DECLS
|
||||
|
||||
/* Some local strings to use genericly with the local metadata file */
|
||||
#define CRS_METADATA_BASE ("# ")
|
||||
#define CRS_METADATA_COMP ("# Component: ")
|
||||
#define CRS_METADATA_COMP ("# OPAL CRS Component: ")
|
||||
#define CRS_METADATA_PID ("# PID: ")
|
||||
#define CRS_METADATA_CONTEXT ("# CONTEXT: ")
|
||||
#define CRS_METADATA_MKDIR ("# MKDIR: ")
|
||||
@ -71,35 +72,25 @@ BEGIN_C_DECLS
|
||||
/**
|
||||
* Globals
|
||||
*/
|
||||
#define opal_crs_base_metadata_filename (strdup("snapshot_meta.data"))
|
||||
|
||||
OPAL_DECLSPEC extern int opal_crs_base_output;
|
||||
OPAL_DECLSPEC extern opal_list_t opal_crs_base_components_available;
|
||||
OPAL_DECLSPEC extern opal_crs_base_component_t opal_crs_base_selected_component;
|
||||
OPAL_DECLSPEC extern opal_crs_base_module_t opal_crs;
|
||||
OPAL_DECLSPEC extern char * opal_crs_base_snapshot_dir;
|
||||
|
||||
/**
|
||||
* Some utility functions
|
||||
*/
|
||||
OPAL_DECLSPEC char * opal_crs_base_state_str(opal_crs_state_type_t state);
|
||||
|
||||
OPAL_DECLSPEC char * opal_crs_base_unique_snapshot_name(pid_t pid);
|
||||
OPAL_DECLSPEC int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** component_name, int *prev_pid);
|
||||
OPAL_DECLSPEC int opal_crs_base_init_snapshot_directory(opal_crs_base_snapshot_t *snapshot);
|
||||
OPAL_DECLSPEC char * opal_crs_base_get_snapshot_directory(char *uniq_snapshot_name);
|
||||
/*
|
||||
* Extract the expected component and pid from the metadata
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_crs_base_extract_expected_component(FILE *metadata, char ** component_name, int *prev_pid);
|
||||
|
||||
/*
|
||||
* Read a token to the metadata file
|
||||
* NULL can be passed for snapshot_loc if nit_snapshot_directory has been called.
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***value);
|
||||
|
||||
/*
|
||||
* Write a token to the metadata file
|
||||
* NULL can be passed for snapshot_loc if nit_snapshot_directory has been called.
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_crs_base_metadata_write_token(char *snapshot_loc, char * token, char *value);
|
||||
OPAL_DECLSPEC int opal_crs_base_metadata_read_token(FILE *metadata, char * token, char ***value);
|
||||
|
||||
/*
|
||||
* Register a file for cleanup.
|
||||
@ -122,6 +113,24 @@ BEGIN_C_DECLS
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target);
|
||||
|
||||
/*
|
||||
* CRS self application interface functions
|
||||
*/
|
||||
typedef int (*opal_crs_base_self_checkpoint_fn_t)(char **restart_cmd);
|
||||
typedef int (*opal_crs_base_self_restart_fn_t)(void);
|
||||
typedef int (*opal_crs_base_self_continue_fn_t)(void);
|
||||
|
||||
extern opal_crs_base_self_checkpoint_fn_t crs_base_self_checkpoint_fn;
|
||||
extern opal_crs_base_self_restart_fn_t crs_base_self_restart_fn;
|
||||
extern opal_crs_base_self_continue_fn_t crs_base_self_continue_fn;
|
||||
|
||||
OPAL_DECLSPEC int opal_crs_base_self_register_checkpoint_callback
|
||||
(opal_crs_base_self_checkpoint_fn_t function);
|
||||
OPAL_DECLSPEC int opal_crs_base_self_register_restart_callback
|
||||
(opal_crs_base_self_restart_fn_t function);
|
||||
OPAL_DECLSPEC int opal_crs_base_self_register_continue_callback
|
||||
(opal_crs_base_self_continue_fn_t function);
|
||||
|
||||
END_C_DECLS
|
||||
|
||||
#endif /* OPAL_CRS_BASE_H */
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2007 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -24,6 +24,12 @@
|
||||
|
||||
int opal_crs_base_close(void)
|
||||
{
|
||||
if( !opal_cr_is_enabled ) {
|
||||
opal_output_verbose(10, opal_crs_base_output,
|
||||
"crs:close: FT is not enabled, skipping!");
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/* Call the component's finalize routine */
|
||||
if( NULL != opal_crs.crs_finalize ) {
|
||||
opal_crs.crs_finalize();
|
||||
|
@ -44,13 +44,15 @@
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
|
||||
opal_crs_base_self_checkpoint_fn_t crs_base_self_checkpoint_fn;
|
||||
opal_crs_base_self_restart_fn_t crs_base_self_restart_fn;
|
||||
opal_crs_base_self_continue_fn_t crs_base_self_continue_fn;
|
||||
|
||||
/******************
|
||||
* Local Functions
|
||||
******************/
|
||||
static int metadata_extract_next_token(FILE *file, char **token, char **value);
|
||||
static int opal_crs_base_metadata_open(FILE ** meta_data, char * location, char * mode);
|
||||
|
||||
static char *last_metadata_file = NULL;
|
||||
static char **cleanup_file_argv = NULL;
|
||||
static char **cleanup_dir_argv = NULL;
|
||||
|
||||
@ -59,30 +61,30 @@ static char **cleanup_dir_argv = NULL;
|
||||
******************/
|
||||
static void opal_crs_base_construct(opal_crs_base_snapshot_t *snapshot)
|
||||
{
|
||||
snapshot->component_name = NULL;
|
||||
snapshot->reference_name = opal_crs_base_unique_snapshot_name(getpid());
|
||||
snapshot->local_location = opal_crs_base_get_snapshot_directory(snapshot->reference_name);
|
||||
snapshot->remote_location = strdup(snapshot->local_location);
|
||||
snapshot->component_name = NULL;
|
||||
|
||||
snapshot->metadata_filename = NULL;
|
||||
snapshot->metadata = NULL;
|
||||
snapshot->snapshot_directory = NULL;
|
||||
|
||||
snapshot->cold_start = false;
|
||||
}
|
||||
|
||||
static void opal_crs_base_destruct( opal_crs_base_snapshot_t *snapshot)
|
||||
{
|
||||
if(NULL != snapshot->reference_name) {
|
||||
free(snapshot->reference_name);
|
||||
snapshot->reference_name = NULL;
|
||||
if(NULL != snapshot->metadata_filename ) {
|
||||
free(snapshot->metadata_filename);
|
||||
snapshot->metadata_filename = NULL;
|
||||
}
|
||||
if(NULL != snapshot->local_location) {
|
||||
free(snapshot->local_location);
|
||||
snapshot->local_location = NULL;
|
||||
|
||||
if(NULL != snapshot->metadata) {
|
||||
fclose(snapshot->metadata);
|
||||
snapshot->metadata = NULL;
|
||||
}
|
||||
if(NULL != snapshot->remote_location) {
|
||||
free(snapshot->remote_location);
|
||||
snapshot->remote_location = NULL;
|
||||
}
|
||||
if(NULL != snapshot->component_name) {
|
||||
free(snapshot->component_name);
|
||||
snapshot->component_name = NULL;
|
||||
|
||||
if(NULL != snapshot->snapshot_directory ) {
|
||||
free(snapshot->snapshot_directory);
|
||||
snapshot->snapshot_directory = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
@ -107,43 +109,29 @@ OBJ_CLASS_INSTANCE(opal_crs_base_ckpt_options_t,
|
||||
/*
|
||||
* Utility functions
|
||||
*/
|
||||
char * opal_crs_base_unique_snapshot_name(pid_t pid)
|
||||
{
|
||||
char * loc_str = NULL;
|
||||
|
||||
asprintf(&loc_str, "opal_snapshot_%d.ckpt", pid);
|
||||
|
||||
return loc_str;
|
||||
}
|
||||
|
||||
int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***value) {
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
FILE * meta_data = NULL;
|
||||
int opal_crs_base_metadata_read_token(FILE *metadata, char * token, char ***value) {
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char * loc_token = NULL;
|
||||
char * loc_value = NULL;
|
||||
int argc = 0;
|
||||
|
||||
/* Dummy check */
|
||||
if( NULL == token ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Open the metadata file
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_open(&meta_data, snapshot_loc, "r")) ) {
|
||||
opal_output(opal_crs_base_output,
|
||||
"opal:crs:base: opal_crs_base_metadata_read_token: Error: Unable to open the metadata file\n");
|
||||
exit_status = ret;
|
||||
if( NULL == metadata ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Extract each token and make the records
|
||||
*/
|
||||
rewind(metadata);
|
||||
do {
|
||||
/* Get next token */
|
||||
if( OPAL_SUCCESS != metadata_extract_next_token(meta_data, &loc_token, &loc_value) ) {
|
||||
if( OPAL_SUCCESS != metadata_extract_next_token(metadata, &loc_token, &loc_value) ) {
|
||||
break;
|
||||
}
|
||||
|
||||
@ -151,54 +139,26 @@ int opal_crs_base_metadata_read_token(char *snapshot_loc, char * token, char ***
|
||||
if(0 == strncmp(token, loc_token, strlen(loc_token)) ) {
|
||||
opal_argv_append(&argc, value, loc_value);
|
||||
}
|
||||
} while(0 == feof(meta_data) );
|
||||
} while(0 == feof(metadata) );
|
||||
|
||||
cleanup:
|
||||
if(NULL != meta_data) {
|
||||
fclose(meta_data);
|
||||
meta_data = NULL;
|
||||
}
|
||||
|
||||
rewind(metadata);
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
int opal_crs_base_metadata_write_token(char *snapshot_loc, char * token, char *value) {
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
FILE * meta_data = NULL;
|
||||
|
||||
/* Dummy check */
|
||||
if( NULL == token || NULL == value) {
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Open the metadata file
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_open(&meta_data, snapshot_loc, "a")) ) {
|
||||
opal_output(opal_crs_base_output,
|
||||
"opal:crs:base: opal_crs_base_metadata_write_token: Error: Unable to open the metadata file\n");
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
fprintf(meta_data, "%s%s\n", token, value);
|
||||
|
||||
cleanup:
|
||||
if(NULL != meta_data) {
|
||||
fclose(meta_data);
|
||||
meta_data = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** component_name, int *prev_pid)
|
||||
int opal_crs_base_extract_expected_component(FILE *metadata, char ** component_name, int *prev_pid)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char **pid_argv = NULL;
|
||||
char **name_argv = NULL;
|
||||
|
||||
opal_crs_base_metadata_read_token(snapshot_loc, CRS_METADATA_PID, &pid_argv);
|
||||
/* Dummy check */
|
||||
if( NULL == metadata ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
opal_crs_base_metadata_read_token(metadata, CRS_METADATA_PID, &pid_argv);
|
||||
if( NULL != pid_argv && NULL != pid_argv[0] ) {
|
||||
*prev_pid = atoi(pid_argv[0]);
|
||||
} else {
|
||||
@ -207,7 +167,7 @@ int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** compone
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
opal_crs_base_metadata_read_token(snapshot_loc, CRS_METADATA_COMP, &name_argv);
|
||||
opal_crs_base_metadata_read_token(metadata, CRS_METADATA_COMP, &name_argv);
|
||||
if( NULL != name_argv && NULL != name_argv[0] ) {
|
||||
*component_name = strdup(name_argv[0]);
|
||||
} else {
|
||||
@ -230,68 +190,6 @@ int opal_crs_base_extract_expected_component(char *snapshot_loc, char ** compone
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
char * opal_crs_base_get_snapshot_directory(char *uniq_snapshot_name)
|
||||
{
|
||||
char * dir_name = NULL;
|
||||
|
||||
asprintf(&dir_name, "%s/%s", opal_crs_base_snapshot_dir, uniq_snapshot_name);
|
||||
|
||||
return dir_name;
|
||||
}
|
||||
|
||||
int opal_crs_base_init_snapshot_directory(opal_crs_base_snapshot_t *snapshot)
|
||||
{
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
mode_t my_mode = S_IRWXU;
|
||||
char * pid_str = NULL;
|
||||
|
||||
/*
|
||||
* Make the snapshot directory from the uniq_snapshot_name
|
||||
*/
|
||||
if(OPAL_SUCCESS != (ret = opal_os_dirpath_create(snapshot->local_location, my_mode)) ) {
|
||||
opal_output(opal_crs_base_output,
|
||||
"opal:crs:base: init_snapshot_directory: Error: Unable to create directory (%s)\n",
|
||||
snapshot->local_location);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Initialize the metadata file at the top of that directory.
|
||||
* Add 'BASE' and 'PID'
|
||||
*/
|
||||
if( NULL != last_metadata_file ) {
|
||||
free(last_metadata_file);
|
||||
last_metadata_file = NULL;
|
||||
}
|
||||
last_metadata_file = strdup(snapshot->local_location);
|
||||
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_BASE, "") ) ) {
|
||||
opal_output(opal_crs_base_output,
|
||||
"opal:crs:base: init_snapshot_directory: Error: Unable to write BASE to the file (%s/%s)\n",
|
||||
snapshot->local_location, opal_crs_base_metadata_filename);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
asprintf(&pid_str, "%d", getpid());
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_PID, pid_str) ) ) {
|
||||
opal_output(opal_crs_base_output,
|
||||
"opal:crs:base: init_snapshot_directory: Error: Unable to write PID (%s) to the file (%s/%s)\n",
|
||||
pid_str, snapshot->local_location, opal_crs_base_metadata_filename);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
if( NULL != pid_str) {
|
||||
free(pid_str);
|
||||
pid_str = NULL;
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_crs_base_cleanup_append(char* filename, bool is_dir)
|
||||
{
|
||||
if( NULL == filename ) {
|
||||
@ -399,6 +297,14 @@ int opal_crs_base_copy_options(opal_crs_base_ckpt_options_t *from,
|
||||
to->term = from->term;
|
||||
to->stop = from->stop;
|
||||
|
||||
to->inc_prep_only = from->inc_prep_only;
|
||||
to->inc_recover_only = from->inc_recover_only;
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
to->attach_debugger = from->attach_debugger;
|
||||
to->detach_debugger = from->detach_debugger;
|
||||
#endif
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
@ -413,6 +319,32 @@ int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target)
|
||||
target->term = false;
|
||||
target->stop = false;
|
||||
|
||||
target->inc_prep_only = false;
|
||||
target->inc_recover_only = false;
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
target->attach_debugger = false;
|
||||
target->detach_debugger = false;
|
||||
#endif
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_crs_base_self_register_checkpoint_callback(opal_crs_base_self_checkpoint_fn_t function)
|
||||
{
|
||||
crs_base_self_checkpoint_fn = function;
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_crs_base_self_register_restart_callback(opal_crs_base_self_restart_fn_t function)
|
||||
{
|
||||
crs_base_self_restart_fn = function;
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_crs_base_self_register_continue_callback(opal_crs_base_self_continue_fn_t function)
|
||||
{
|
||||
crs_base_self_continue_fn = function;
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
@ -420,38 +352,6 @@ int opal_crs_base_clear_options(opal_crs_base_ckpt_options_t *target)
|
||||
/******************
|
||||
* Local Functions
|
||||
******************/
|
||||
static int opal_crs_base_metadata_open(FILE **meta_data, char * location, char * mode)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char * dir_name = NULL;
|
||||
|
||||
if( NULL == location ) {
|
||||
if( NULL == last_metadata_file ) {
|
||||
opal_output(0, "Error: No metadata filename specified!");
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
} else {
|
||||
location = last_metadata_file;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Find the snapshot directory, read the metadata file
|
||||
*/
|
||||
asprintf(&dir_name, "%s/%s", location, opal_crs_base_metadata_filename);
|
||||
if (NULL == (*meta_data = fopen(dir_name, mode)) ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
if( NULL != dir_name ) {
|
||||
free(dir_name);
|
||||
dir_name = NULL;
|
||||
}
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
static int metadata_extract_next_token(FILE *file, char **token, char **value)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
@ -558,12 +458,20 @@ static int metadata_extract_next_token(FILE *file, char **token, char **value)
|
||||
*value = strdup(local_value);
|
||||
|
||||
cleanup:
|
||||
if( NULL != local_token)
|
||||
if( NULL != local_token) {
|
||||
free(local_token);
|
||||
if( NULL != local_value)
|
||||
local_token = NULL;
|
||||
}
|
||||
|
||||
if( NULL != local_value) {
|
||||
free(local_value);
|
||||
if( NULL != line)
|
||||
local_value = NULL;
|
||||
}
|
||||
|
||||
if( NULL != line) {
|
||||
free(line);
|
||||
line = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -48,7 +48,6 @@ opal_crs_base_module_t opal_crs = {
|
||||
};
|
||||
opal_list_t opal_crs_base_components_available;
|
||||
opal_crs_base_component_t opal_crs_base_selected_component;
|
||||
char * opal_crs_base_snapshot_dir = NULL;
|
||||
|
||||
/**
|
||||
* Function for finding and opening either all MCA components,
|
||||
@ -73,14 +72,6 @@ int opal_crs_base_open(void)
|
||||
}
|
||||
opal_output_set_verbosity(opal_crs_base_output, value);
|
||||
|
||||
/* Base snapshot directory */
|
||||
mca_base_param_reg_string_name("crs",
|
||||
"base_snapshot_dir",
|
||||
"The base directory to use when storing snapshots",
|
||||
true, false,
|
||||
strdup("/tmp"),
|
||||
&opal_crs_base_snapshot_dir);
|
||||
|
||||
/*
|
||||
* Which CRS component to open
|
||||
* - NULL or "" = auto-select
|
||||
@ -90,7 +81,13 @@ int opal_crs_base_open(void)
|
||||
mca_base_param_reg_string_name("crs", NULL,
|
||||
"Which CRS component to use (empty = auto-select)",
|
||||
false, false,
|
||||
"none", &str_value);
|
||||
NULL, &str_value);
|
||||
|
||||
if( !opal_cr_is_enabled ) {
|
||||
opal_output_verbose(10, opal_crs_base_output,
|
||||
"crs:open: FT is not enabled, skipping!");
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/* Open up all available components */
|
||||
if (OPAL_SUCCESS != (ret = mca_base_components_open("crs",
|
||||
@ -110,5 +107,6 @@ int opal_crs_base_open(void)
|
||||
if( NULL != str_value ) {
|
||||
free(str_value);
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2008 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
* Copyright (c) 2004-2005 The Trustees of the University of Tennessee.
|
||||
* All rights reserved.
|
||||
@ -37,6 +37,12 @@ int opal_crs_base_select(void)
|
||||
opal_crs_base_module_t *best_module = NULL;
|
||||
int int_value = 0;
|
||||
|
||||
if( !opal_cr_is_enabled ) {
|
||||
opal_output_verbose(10, opal_crs_base_output,
|
||||
"crs:select: FT is not enabled, skipping!");
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* Note: If we are a tool, then we will manually run the selection routine
|
||||
* for the checkpointer. The tool will set the MCA parameter
|
||||
|
@ -167,6 +167,14 @@ AC_DEFUN([MCA_crs_blcr_CONFIG],[
|
||||
[BLCRs cr_checkpoint_info.requester member availability])
|
||||
$1])
|
||||
|
||||
#
|
||||
# Require either a working cr_request_file() or cr_request_checkpoint() function
|
||||
#
|
||||
AS_IF([test "$crs_blcr_have_working_cr_request" = "0" -a "$crs_blcr_have_cr_request_checkpoint" = "0"],
|
||||
[$2
|
||||
check_crs_blcr_good="no"
|
||||
AC_MSG_WARN([The BLCR CRS component requires either the cr_request_checkpoint() or cr_request_file() functions])])
|
||||
|
||||
#
|
||||
# Reset the flags
|
||||
#
|
||||
|
@ -34,6 +34,7 @@
|
||||
|
||||
#include "opal/mca/base/mca_base_param.h"
|
||||
|
||||
#include "opal/threads/threads.h"
|
||||
#include "opal/threads/mutex.h"
|
||||
#include "opal/threads/condition.h"
|
||||
|
||||
@ -94,20 +95,26 @@ OBJ_CLASS_INSTANCE(opal_crs_blcr_snapshot_t,
|
||||
/******************
|
||||
* Local Functions
|
||||
******************/
|
||||
static int blcr_checkpoint_peer(pid_t pid, char * local_dir, char ** fname);
|
||||
static int blcr_get_checkpoint_filename(char **fname, pid_t pid);
|
||||
static int opal_crs_blcr_thread_callback(void *arg);
|
||||
static int opal_crs_blcr_signal_callback(void *arg);
|
||||
|
||||
static int opal_crs_blcr_checkpoint_cmd(pid_t pid, char * local_dir, char **fname, char **cmd);
|
||||
static int opal_crs_blcr_restart_cmd(char *fname, char **cmd);
|
||||
|
||||
static int blcr_update_snapshot_metadata(opal_crs_blcr_snapshot_t *snapshot);
|
||||
static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot);
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
static void MPIR_checkpoint_debugger_crs_hook(cr_hook_event_t event);
|
||||
#endif
|
||||
|
||||
/*************************
|
||||
* Local Global Variables
|
||||
*************************/
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
static opal_thread_t *checkpoint_thread_id = NULL;
|
||||
static bool blcr_crdebug_refreshed_env = false;
|
||||
#endif
|
||||
|
||||
static cr_client_id_t client_id;
|
||||
static cr_callback_id_t cr_thread_callback_id;
|
||||
static cr_callback_id_t cr_signal_callback_id;
|
||||
@ -127,8 +134,10 @@ void opal_crs_blcr_construct(opal_crs_blcr_snapshot_t *snapshot) {
|
||||
}
|
||||
|
||||
void opal_crs_blcr_destruct( opal_crs_blcr_snapshot_t *snapshot) {
|
||||
if(NULL != snapshot->context_filename)
|
||||
if(NULL != snapshot->context_filename) {
|
||||
free(snapshot->context_filename);
|
||||
snapshot->context_filename = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
/*****************
|
||||
@ -167,6 +176,10 @@ int opal_crs_blcr_module_init(void)
|
||||
}
|
||||
}
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
blcr_crdebug_refreshed_env = false;
|
||||
#endif
|
||||
|
||||
blcr_restart_cmd = strdup("cr_restart");
|
||||
blcr_checkpoint_cmd = strdup("cr_checkpoint");
|
||||
|
||||
@ -190,6 +203,20 @@ int opal_crs_blcr_module_init(void)
|
||||
cr_signal_callback_id = cr_register_callback(opal_crs_blcr_signal_callback,
|
||||
crs_blcr_signal_callback_arg,
|
||||
CR_SIGNAL_CONTEXT);
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/*
|
||||
* Checkpoint/restart enabled debugging hooks
|
||||
* "NO_CALLBACKS" -> non-MPI threads
|
||||
* "SIGNAL_CONTEXT" -> MPI threads
|
||||
* "THREAD_CONTEXT" -> BLCR threads
|
||||
*/
|
||||
cr_register_hook(CR_HOOK_CONT_NO_CALLBACKS, MPIR_checkpoint_debugger_crs_hook);
|
||||
cr_register_hook(CR_HOOK_CONT_SIGNAL_CONTEXT, MPIR_checkpoint_debugger_crs_hook);
|
||||
|
||||
cr_register_hook(CR_HOOK_RSTRT_NO_CALLBACKS, MPIR_checkpoint_debugger_crs_hook);
|
||||
cr_register_hook(CR_HOOK_RSTRT_SIGNAL_CONTEXT, MPIR_checkpoint_debugger_crs_hook);
|
||||
#endif
|
||||
}
|
||||
|
||||
/*
|
||||
@ -262,6 +289,17 @@ int opal_crs_blcr_module_finalize(void)
|
||||
cr_replace_callback(cr_thread_callback_id, NULL, NULL, CR_THREAD_CONTEXT);
|
||||
/* Unload the signal callback */
|
||||
cr_replace_callback(cr_signal_callback_id, NULL, NULL, CR_SIGNAL_CONTEXT);
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/*
|
||||
* Checkpoint/restart enabled debugging hooks
|
||||
*/
|
||||
cr_register_hook(CR_HOOK_CONT_NO_CALLBACKS, NULL);
|
||||
cr_register_hook(CR_HOOK_CONT_SIGNAL_CONTEXT, NULL);
|
||||
|
||||
cr_register_hook(CR_HOOK_RSTRT_NO_CALLBACKS, NULL);
|
||||
cr_register_hook(CR_HOOK_RSTRT_SIGNAL_CONTEXT, NULL);
|
||||
#endif
|
||||
}
|
||||
|
||||
/* BLCR does not have a finalization routine */
|
||||
@ -275,175 +313,158 @@ int opal_crs_blcr_checkpoint(pid_t pid,
|
||||
opal_crs_state_type_t *state)
|
||||
{
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
opal_crs_blcr_snapshot_t *snapshot = OBJ_NEW(opal_crs_blcr_snapshot_t);
|
||||
opal_crs_blcr_snapshot_t *snapshot = NULL;
|
||||
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1
|
||||
cr_checkpoint_args_t cr_args;
|
||||
static cr_checkpoint_handle_t cr_handle = (cr_checkpoint_handle_t)(-1);
|
||||
#endif
|
||||
int fd = 0;
|
||||
char *loc_fname = NULL;
|
||||
|
||||
if( pid != my_pid ) {
|
||||
opal_output(0, "crs:blcr: checkpoint(%d, ---): Checkpointing of peers not allowed!", pid);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(%d, ---)", pid);
|
||||
|
||||
if(NULL != snapshot->super.reference_name)
|
||||
free(snapshot->super.reference_name);
|
||||
snapshot->super.reference_name = strdup(base_snapshot->reference_name);
|
||||
|
||||
if(NULL != snapshot->super.local_location)
|
||||
free(snapshot->super.local_location);
|
||||
snapshot->super.local_location = strdup(base_snapshot->local_location);
|
||||
|
||||
if(NULL != snapshot->super.remote_location)
|
||||
free(snapshot->super.remote_location);
|
||||
snapshot->super.remote_location = strdup(base_snapshot->remote_location);
|
||||
snapshot = (opal_crs_blcr_snapshot_t *)base_snapshot;
|
||||
|
||||
/*
|
||||
* Update the snapshot metadata
|
||||
*/
|
||||
snapshot->super.component_name = strdup(mca_crs_blcr_component.super.base_version.mca_component_name);
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, snapshot->super.component_name) ) ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to write component name to the directory for (%s).",
|
||||
snapshot->super.reference_name);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
blcr_get_checkpoint_filename(&(snapshot->context_filename), pid);
|
||||
|
||||
if( NULL == snapshot->super.metadata ) {
|
||||
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to open the file (%s)",
|
||||
snapshot->super.metadata_filename);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->super.component_name);
|
||||
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_CONTEXT, snapshot->context_filename);
|
||||
|
||||
fclose(snapshot->super.metadata );
|
||||
snapshot->super.metadata = NULL;
|
||||
|
||||
/*
|
||||
* If we can checkpointing ourselves do so:
|
||||
* use cr_request_checkpoint() if available, and cr_request_file() if not
|
||||
*/
|
||||
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 || CRS_BLCR_HAVE_CR_REQUEST == 1
|
||||
if( pid == my_pid ) {
|
||||
char *loc_fname = NULL;
|
||||
if( opal_crs_blcr_dev_null ) {
|
||||
loc_fname = strdup("/dev/null");
|
||||
} else {
|
||||
asprintf(&loc_fname, "%s/%s", snapshot->super.snapshot_directory, snapshot->context_filename);
|
||||
}
|
||||
|
||||
blcr_get_checkpoint_filename(&(snapshot->context_filename), pid);
|
||||
if( opal_crs_blcr_dev_null ) {
|
||||
loc_fname = strdup("/dev/null");
|
||||
} else {
|
||||
asprintf(&loc_fname, "%s/%s", snapshot->super.local_location, snapshot->context_filename);
|
||||
}
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/* Make sure to identify the checkpointing thread, so that it is not
|
||||
* prevented from requesting the checkpoint after the debugger detaches
|
||||
*/
|
||||
opal_cr_debug_set_current_ckpt_thread_self();
|
||||
checkpoint_thread_id = opal_thread_get_self();
|
||||
blcr_crdebug_refreshed_env = false;
|
||||
|
||||
/* If checkpoint/restart enabled debugging then mark detachment place */
|
||||
if( MPIR_debug_with_checkpoint ) {
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint SELF <%s>",
|
||||
loc_fname);
|
||||
"crs:blcr: checkpoint(): Detaching debugger...");
|
||||
MPIR_checkpoint_debugger_detach();
|
||||
}
|
||||
#endif
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint SELF <%s>",
|
||||
loc_fname);
|
||||
|
||||
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1 || CRS_BLCR_HAVE_CR_REQUEST == 1
|
||||
#if CRS_BLCR_HAVE_CR_REQUEST_CHECKPOINT == 1
|
||||
{
|
||||
int fd = 0;
|
||||
fd = open(loc_fname,
|
||||
O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE,
|
||||
S_IRUSR | S_IWUSR);
|
||||
if( fd < 0 ) {
|
||||
fd = open(loc_fname,
|
||||
O_WRONLY | O_CREAT | O_TRUNC | O_LARGEFILE,
|
||||
S_IRUSR | S_IWUSR);
|
||||
if( fd < 0 ) {
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to open checkpoint file (%s) for pid (%d)",
|
||||
loc_fname, pid);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cr_initialize_checkpoint_args_t(&cr_args);
|
||||
cr_args.cr_scope = CR_SCOPE_PROC;
|
||||
cr_args.cr_fd = fd;
|
||||
if( options->stop ) {
|
||||
cr_args.cr_signal = SIGSTOP;
|
||||
}
|
||||
|
||||
ret = cr_request_checkpoint(&cr_args, &cr_handle);
|
||||
if( ret < 0 ) {
|
||||
close(cr_args.cr_fd);
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s)",
|
||||
pid, loc_fname);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* Wait for checkpoint to finish */
|
||||
do {
|
||||
ret = cr_poll_checkpoint(&cr_handle, NULL);
|
||||
if( ret < 0 ) {
|
||||
/* Check if restarting. This is not an error. */
|
||||
if( (ret == CR_POLL_CHKPT_ERR_POST) && (errno == CR_ERESTARTED) ) {
|
||||
ret = 0;
|
||||
break;
|
||||
}
|
||||
/* If Call was interrupted by a signal, retry the call */
|
||||
else if (errno == EINTR) {
|
||||
;
|
||||
}
|
||||
/* Otherwise this is a real error that we need to deal with */
|
||||
else {
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to open checkpoint file (%s) for pid (%d)",
|
||||
loc_fname, pid);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cr_initialize_checkpoint_args_t(&cr_args);
|
||||
cr_args.cr_scope = CR_SCOPE_PROC;
|
||||
cr_args.cr_fd = fd;
|
||||
if( options->stop ) {
|
||||
cr_args.cr_signal = SIGSTOP;
|
||||
}
|
||||
|
||||
ret = cr_request_checkpoint(&cr_args, &cr_handle);
|
||||
if( ret < 0 ) {
|
||||
close(cr_args.cr_fd);
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s)",
|
||||
pid, loc_fname);
|
||||
"crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s) - poll failed with (%d)",
|
||||
pid, loc_fname, ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* Wait for checkpoint to finish */
|
||||
do {
|
||||
ret = cr_poll_checkpoint(&cr_handle, NULL);
|
||||
if( ret < 0 ) {
|
||||
/* Check if restarting. This is not an error. */
|
||||
if( (ret == CR_POLL_CHKPT_ERR_POST) && (errno == CR_ERESTARTED) ) {
|
||||
ret = 0;
|
||||
break;
|
||||
}
|
||||
/* If Call was interrupted by a signal, retry the call */
|
||||
else if (errno == EINTR) {
|
||||
;
|
||||
}
|
||||
/* Otherwise this is a real error that we need to deal with */
|
||||
else {
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d) to file (%s) - poll failed with (%d)",
|
||||
pid, loc_fname, ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
} while( ret < 0 );
|
||||
|
||||
/* Close the file */
|
||||
close(cr_args.cr_fd);
|
||||
}
|
||||
} while( ret < 0 );
|
||||
|
||||
/* Close the file */
|
||||
close(cr_args.cr_fd);
|
||||
#else
|
||||
/* Request a checkpoint be taken of the current process.
|
||||
* Since we are not guaranteed to finish the checkpoint before this
|
||||
* returns, we also need to wait for it.
|
||||
*/
|
||||
cr_request_file(loc_fname);
|
||||
/* Request a checkpoint be taken of the current process.
|
||||
* Since we are not guaranteed to finish the checkpoint before this
|
||||
* returns, we also need to wait for it.
|
||||
*/
|
||||
cr_request_file(loc_fname);
|
||||
|
||||
/* Wait for checkpoint to finish */
|
||||
do {
|
||||
usleep(1000); /* JJH Do we really want to sleep? */
|
||||
} while(CR_STATE_IDLE != cr_status());
|
||||
/* Wait for checkpoint to finish */
|
||||
do {
|
||||
usleep(1000); /* JJH Do we really want to sleep? */
|
||||
} while(CR_STATE_IDLE != cr_status());
|
||||
#endif
|
||||
#endif
|
||||
|
||||
*state = blcr_current_state;
|
||||
free(loc_fname);
|
||||
}
|
||||
/*
|
||||
* Checkpointing another process
|
||||
*/
|
||||
else
|
||||
#endif
|
||||
{
|
||||
ret = blcr_checkpoint_peer(pid, snapshot->super.local_location, &(snapshot->context_filename));
|
||||
|
||||
if(OPAL_SUCCESS != ret) {
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to checkpoint pid (%d)",
|
||||
pid);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
*state = blcr_current_state;
|
||||
}
|
||||
|
||||
if(*state == OPAL_CRS_CONTINUE) {
|
||||
/*
|
||||
* Update the metadata file
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = blcr_update_snapshot_metadata(snapshot)) ) {
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to update metadata for snapshot (%s).",
|
||||
snapshot->super.reference_name);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Return to the caller
|
||||
*/
|
||||
base_snapshot = &(snapshot->super);
|
||||
*state = blcr_current_state;
|
||||
free(loc_fname);
|
||||
|
||||
cleanup:
|
||||
if( NULL != snapshot->super.metadata ) {
|
||||
fclose(snapshot->super.metadata );
|
||||
snapshot->super.metadata = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
@ -459,7 +480,7 @@ int opal_crs_blcr_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
||||
snapshot->super = *base_snapshot;
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: restart(%s, %d)", snapshot->super.reference_name, spawn_child);
|
||||
"crs:blcr: restart(--, %d)", spawn_child);
|
||||
|
||||
/*
|
||||
* If we need to reconstruct the snapshot,
|
||||
@ -486,10 +507,6 @@ int opal_crs_blcr_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
|
||||
/*
|
||||
* Restart by replacing this process
|
||||
*/
|
||||
/* Need to shutdown the event engine before this.
|
||||
* for some reason the BLCR checkpointer and our event engine don't get
|
||||
* along very well.
|
||||
@ -586,94 +603,6 @@ int opal_crs_blcr_enable_checkpoint(void)
|
||||
/*****************************
|
||||
* Local Function Definitions
|
||||
*****************************/
|
||||
static int blcr_checkpoint_peer(pid_t pid, char * local_dir, char ** fname)
|
||||
{
|
||||
char **cr_argv = NULL;
|
||||
char *cr_cmd = NULL;
|
||||
int ret;
|
||||
pid_t child_pid;
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
int status, child_status;
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint_peer(%d, --)", pid);
|
||||
|
||||
/*
|
||||
* Get the checkpoint command
|
||||
*/
|
||||
if ( OPAL_SUCCESS != (ret = opal_crs_blcr_checkpoint_cmd(pid, local_dir, fname, &cr_cmd)) ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint_peer: Failed to generate checkpoint command :(%d):", ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
if ( NULL == (cr_argv = opal_argv_split(cr_cmd, ' ')) ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint_peer: Failed to opal_argv_split :(%d):", ret);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Fork a child to do the checkpoint
|
||||
*/
|
||||
blcr_current_state = OPAL_CRS_CHECKPOINT;
|
||||
|
||||
child_pid = fork();
|
||||
|
||||
if(0 == child_pid) {
|
||||
/* Child Process */
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_checkpoint_peer: exec :(%s, %s):",
|
||||
strdup(blcr_checkpoint_cmd),
|
||||
opal_argv_join(cr_argv, ' '));
|
||||
|
||||
status = execvp(strdup(blcr_checkpoint_cmd), cr_argv);
|
||||
|
||||
if(status < 0) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_checkpoint_peer: Child failed to execute :(%d):", status);
|
||||
}
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_checkpoint_peer: execvp returned %d", status);
|
||||
}
|
||||
else if(child_pid > 0) {
|
||||
/* Don't waitpid here since we don't really want to restart from inside waitpid ;) */
|
||||
while(OPAL_CRS_RESTART != blcr_current_state &&
|
||||
OPAL_CRS_CONTINUE != blcr_current_state ) {
|
||||
OPAL_THREAD_LOCK(&blcr_lock);
|
||||
opal_condition_wait(&blcr_cond, &blcr_lock);
|
||||
OPAL_THREAD_UNLOCK(&blcr_lock);
|
||||
}
|
||||
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_checkpoint_peer: Thread finished with status %d", blcr_current_state);
|
||||
|
||||
if(OPAL_CRS_CONTINUE == blcr_current_state) {
|
||||
/* Wait for the child only if we are continuing */
|
||||
if( 0 > waitpid(child_pid, &child_status, 0) ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_checkpoint_peer: waitpid returned %d", child_status);
|
||||
}
|
||||
}
|
||||
}
|
||||
else {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_checkpoint_peer: fork failed :(%d):", child_pid);
|
||||
}
|
||||
|
||||
/*
|
||||
* Cleanup
|
||||
*/
|
||||
cleanup:
|
||||
if(NULL != cr_cmd)
|
||||
free(cr_cmd);
|
||||
if(NULL != cr_argv)
|
||||
opal_argv_free(cr_argv);
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
static int opal_crs_blcr_thread_callback(void *arg) {
|
||||
const struct cr_checkpoint_info *ckpt_info = cr_get_checkpoint_info();
|
||||
int ret;
|
||||
@ -700,6 +629,11 @@ static int opal_crs_blcr_thread_callback(void *arg) {
|
||||
else
|
||||
#endif
|
||||
{
|
||||
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_CRS_PRE_CKPT,
|
||||
OMPI_CR_INC_STATE_PREPARE)) ) {
|
||||
;
|
||||
}
|
||||
|
||||
ret = cr_checkpoint(0);
|
||||
}
|
||||
|
||||
@ -720,6 +654,13 @@ static int opal_crs_blcr_thread_callback(void *arg) {
|
||||
blcr_current_state = OPAL_CRS_CONTINUE;
|
||||
}
|
||||
|
||||
if( OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_CRS_POST_CKPT,
|
||||
(blcr_current_state == OPAL_CRS_CONTINUE ?
|
||||
OMPI_CR_INC_STATE_CONTINUE :
|
||||
OMPI_CR_INC_STATE_RESTART))) ) {
|
||||
;
|
||||
}
|
||||
|
||||
OPAL_THREAD_UNLOCK(&blcr_lock);
|
||||
opal_condition_signal(&blcr_cond);
|
||||
|
||||
@ -747,66 +688,6 @@ static int opal_crs_blcr_signal_callback(void *arg) {
|
||||
return 0;
|
||||
}
|
||||
|
||||
static int opal_crs_blcr_checkpoint_cmd(pid_t pid, char * local_dir, char **fname, char **cmd)
|
||||
{
|
||||
char **cr_argv = NULL;
|
||||
int argc = 0, ret;
|
||||
char * pid_str;
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char * loc_fname = NULL;
|
||||
|
||||
blcr_get_checkpoint_filename(fname, pid);
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint_cmd(%d)", pid);
|
||||
|
||||
asprintf(&loc_fname, "%s/%s", local_dir, *fname);
|
||||
|
||||
/*
|
||||
* Build the command
|
||||
*/
|
||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(blcr_checkpoint_cmd)))) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup("--pid")))) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
asprintf(&pid_str, "%d", pid);
|
||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(pid_str)))) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup("--file")))) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (OPAL_SUCCESS != (ret = opal_argv_append(&argc, &cr_argv, strdup(loc_fname)))) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
if(exit_status != OPAL_SUCCESS)
|
||||
*cmd = NULL;
|
||||
else
|
||||
*cmd = opal_argv_join(cr_argv, ' ');
|
||||
|
||||
if(NULL != pid_str)
|
||||
free(pid_str);
|
||||
if( NULL != cr_argv)
|
||||
opal_argv_free(cr_argv);
|
||||
if(NULL != loc_fname)
|
||||
free(loc_fname);
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
static int opal_crs_blcr_restart_cmd(char *fname, char **cmd)
|
||||
{
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
@ -833,32 +714,6 @@ static int blcr_get_checkpoint_filename(char **fname, pid_t pid)
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
static int blcr_update_snapshot_metadata(opal_crs_blcr_snapshot_t *snapshot) {
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: update_snapshot_metadata(%s)", snapshot->super.reference_name);
|
||||
|
||||
/* Bozo check to make sure this snapshot is ours */
|
||||
if ( 0 != strncmp(mca_crs_blcr_component.super.base_version.mca_component_name,
|
||||
snapshot->super.component_name,
|
||||
strlen(snapshot->super.component_name)) ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_update_snapshot_metadata: Error: This snapshot (%s) is not intended for us (%s)\n",
|
||||
snapshot->super.component_name, mca_crs_blcr_component.super.base_version.mca_component_name);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Append to the metadata file the context filename
|
||||
*/
|
||||
opal_crs_base_metadata_write_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, snapshot->context_filename);
|
||||
|
||||
cleanup:
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
char **tmp_argv = NULL;
|
||||
@ -866,16 +721,25 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
||||
int prev_pid;
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: cold_start(%s)", snapshot->super.reference_name);
|
||||
"crs:blcr: cold_start()");
|
||||
|
||||
/*
|
||||
* Find the snapshot directory, read the metadata file
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.local_location,
|
||||
if( NULL == snapshot->super.metadata ) {
|
||||
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "r")) ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: checkpoint(): Error: Unable to open the file (%s)",
|
||||
snapshot->super.metadata_filename);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.metadata,
|
||||
&component_name, &prev_pid) ) ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.",
|
||||
snapshot->super.local_location, ret);
|
||||
snapshot->super.metadata_filename, ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
@ -895,15 +759,15 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
||||
/*
|
||||
* Context Filename
|
||||
*/
|
||||
opal_crs_base_metadata_read_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||
opal_crs_base_metadata_read_token(snapshot->super.metadata, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||
if( NULL == tmp_argv ) {
|
||||
opal_output(mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: blcr_cold_start: Error: Failed to read the %s token from the local checkpoint in %s",
|
||||
CRS_METADATA_CONTEXT, snapshot->super.local_location);
|
||||
CRS_METADATA_CONTEXT, snapshot->super.snapshot_directory);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
asprintf(&snapshot->context_filename, "%s/%s", snapshot->super.local_location, tmp_argv[0]);
|
||||
asprintf(&snapshot->context_filename, "%s/%s", snapshot->super.snapshot_directory, tmp_argv[0]);
|
||||
|
||||
/*
|
||||
* Reset the cold_start flag
|
||||
@ -916,5 +780,75 @@ static int blcr_cold_start(opal_crs_blcr_snapshot_t *snapshot) {
|
||||
tmp_argv = NULL;
|
||||
}
|
||||
|
||||
if( NULL != snapshot->super.metadata ) {
|
||||
fclose(snapshot->super.metadata);
|
||||
snapshot->super.metadata = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
static void MPIR_checkpoint_debugger_crs_hook(cr_hook_event_t event) {
|
||||
opal_thread_t *my_thread_id = NULL;
|
||||
my_thread_id = opal_thread_get_self();
|
||||
|
||||
/* Non-MPI threads */
|
||||
if(event == CR_HOOK_RSTRT_NO_CALLBACKS ) {
|
||||
/* wait for the MPI thread to refresh the environment for us */
|
||||
while(!blcr_crdebug_refreshed_env) {
|
||||
sched_yield();
|
||||
}
|
||||
}
|
||||
/* MPI threads */
|
||||
else if(event == CR_HOOK_RSTRT_SIGNAL_CONTEXT ) {
|
||||
if( opal_thread_self_compare(checkpoint_thread_id) ) {
|
||||
opal_cr_refresh_environ(my_pid);
|
||||
blcr_crdebug_refreshed_env = true;
|
||||
} else {
|
||||
while(!blcr_crdebug_refreshed_env) {
|
||||
sched_yield();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Some debugging output
|
||||
*/
|
||||
/* Non-MPI threads */
|
||||
if( event == CR_HOOK_CONT_NO_CALLBACKS ) {
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Continue (Non-MPI). (%d)",
|
||||
(int)my_thread_id->t_handle);
|
||||
}
|
||||
else if(event == CR_HOOK_RSTRT_NO_CALLBACKS ) {
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Restart (Non-MPI). (%d)",
|
||||
(int)my_thread_id->t_handle);
|
||||
}
|
||||
/* MPI Threads */
|
||||
else if( event == CR_HOOK_CONT_SIGNAL_CONTEXT ) {
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Continue (MPI).");
|
||||
}
|
||||
else if(event == CR_HOOK_RSTRT_SIGNAL_CONTEXT ) {
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Waiting in Restart (MPI).");
|
||||
}
|
||||
|
||||
/*
|
||||
* Enter the breakpoint function.
|
||||
* If no debugger intends on attaching, then this function is expected to
|
||||
* return immediately.
|
||||
*
|
||||
* If this is an MPI thread then odds are that this is the checkpointing
|
||||
* thread, in which case this function will return immediately allowing
|
||||
* it to prepare the MPI library before signaling to the debugger that
|
||||
* it is safe to attach, if necessary.
|
||||
*/
|
||||
MPIR_checkpoint_debugger_waitpoint();
|
||||
|
||||
opal_output_verbose(10, mca_crs_blcr_component.super.output_handle,
|
||||
"crs:blcr: MPIR_checkpoint_debugger_crs_hook: Finished...");
|
||||
}
|
||||
#endif
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2009 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -79,6 +79,19 @@ struct opal_crs_base_ckpt_options_1_0_0_t {
|
||||
bool term;
|
||||
/** Send SIGSTOP after checkpoint */
|
||||
bool stop;
|
||||
|
||||
/** INC Prep Only */
|
||||
bool inc_prep_only;
|
||||
|
||||
/** INC Recover Only */
|
||||
bool inc_recover_only;
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/** Wait for debugger to attach after checkpoint */
|
||||
bool attach_debugger;
|
||||
/** Do not wait for debugger to reattach after checkpoint */
|
||||
bool detach_debugger;
|
||||
#endif
|
||||
};
|
||||
typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_1_0_0_t;
|
||||
typedef struct opal_crs_base_ckpt_options_1_0_0_t opal_crs_base_ckpt_options_t;
|
||||
@ -96,12 +109,14 @@ struct opal_crs_base_snapshot_1_0_0_t {
|
||||
/** MCA Component name */
|
||||
char * component_name;
|
||||
|
||||
/** Unique name of snapshot */
|
||||
char * reference_name;
|
||||
/** Metadata filename */
|
||||
char * metadata_filename;
|
||||
|
||||
/** Metadata fd */
|
||||
FILE * metadata;
|
||||
|
||||
/** Absolute path the the snapshot directory */
|
||||
char * local_location;
|
||||
char * remote_location;
|
||||
char * snapshot_directory;
|
||||
|
||||
/** Cold Start:
|
||||
* If we are restarting cold, then we need to recreate this structure
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2009 The Trustees of Indiana University.
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
@ -58,25 +58,25 @@ int opal_crs_none_checkpoint(pid_t pid,
|
||||
opal_crs_base_ckpt_options_t *options,
|
||||
opal_crs_state_type_t *state)
|
||||
{
|
||||
int ret;
|
||||
|
||||
*state = OPAL_CRS_CONTINUE;
|
||||
|
||||
snapshot->component_name = strdup("none");
|
||||
snapshot->reference_name = strdup("none");
|
||||
snapshot->local_location = strdup("");
|
||||
snapshot->remote_location = strdup("");
|
||||
snapshot->cold_start = false;
|
||||
|
||||
/*
|
||||
* Update the snapshot metadata
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, "none") ) ) {
|
||||
opal_output(0,
|
||||
"crs:none: checkpoint(): Error: Unable to write component name to the directory for (%s).",
|
||||
snapshot->reference_name);
|
||||
return ret;
|
||||
if( NULL == snapshot->metadata ) {
|
||||
if (NULL == (snapshot->metadata = fopen(snapshot->metadata_filename, "a")) ) {
|
||||
opal_output(0,
|
||||
"crs:none: checkpoint(): Error: Unable to open the file (%s)",
|
||||
snapshot->metadata_filename);
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
}
|
||||
fprintf(snapshot->metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->component_name);
|
||||
fclose(snapshot->metadata);
|
||||
snapshot->metadata = NULL;
|
||||
|
||||
if( options->stop ) {
|
||||
opal_output(0,
|
||||
@ -88,28 +88,43 @@ int opal_crs_none_checkpoint(pid_t pid,
|
||||
|
||||
int opal_crs_none_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_child, pid_t *child_pid)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char **tmp_argv = NULL;
|
||||
char **cr_argv = NULL;
|
||||
int status;
|
||||
|
||||
*child_pid = getpid();
|
||||
|
||||
opal_crs_base_metadata_read_token(base_snapshot->local_location, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||
if( NULL == base_snapshot->metadata ) {
|
||||
if (NULL == (base_snapshot->metadata = fopen(base_snapshot->metadata_filename, "a")) ) {
|
||||
opal_output(0,
|
||||
"crs:none: checkpoint(): Error: Unable to open the file (%s)",
|
||||
base_snapshot->metadata_filename);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
|
||||
opal_crs_base_metadata_read_token(base_snapshot->metadata, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||
|
||||
if( NULL == tmp_argv ) {
|
||||
opal_output(opal_crs_base_output,
|
||||
"crs:none: none_restart: Error: Failed to read the %s token from the local checkpoint in %s",
|
||||
CRS_METADATA_CONTEXT, base_snapshot->local_location);
|
||||
return OPAL_ERROR;
|
||||
CRS_METADATA_CONTEXT, base_snapshot->metadata_filename);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if( opal_argv_count(tmp_argv) <= 0 ) {
|
||||
opal_output_verbose(10, opal_crs_base_output,
|
||||
"crs:none: none_restart: No command line to exec, so just returning");
|
||||
return OPAL_SUCCESS;
|
||||
exit_status = OPAL_SUCCESS;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if ( NULL == (cr_argv = opal_argv_split(tmp_argv[0], ' ')) ) {
|
||||
return OPAL_ERROR;
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if( !spawn_child ) {
|
||||
@ -126,14 +141,20 @@ int opal_crs_none_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
||||
}
|
||||
opal_output(opal_crs_base_output,
|
||||
"crs:none: none_restart: execvp returned %d", status);
|
||||
return status;
|
||||
exit_status = status;
|
||||
goto cleanup;
|
||||
} else {
|
||||
opal_output(opal_crs_base_output,
|
||||
"crs:none: none_restart: Spawn not implemented");
|
||||
return OPAL_ERR_NOT_IMPLEMENTED;
|
||||
exit_status = OPAL_ERR_NOT_IMPLEMENTED;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
cleanup:
|
||||
fclose(base_snapshot->metadata);
|
||||
base_snapshot->metadata = NULL;
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
int opal_crs_none_disable_checkpoint(void)
|
||||
|
@ -1,5 +1,5 @@
|
||||
.\"
|
||||
.\" Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
.\" Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
.\" University Research and Technology
|
||||
.\" Corporation. All rights reserved.
|
||||
.\" Copyright (c) 2009 Sun Microsystems, Inc. All rights reserved.
|
||||
@ -89,10 +89,6 @@ The following MCA parameters apply to all components:
|
||||
crs_base_verbose
|
||||
Set the verbosity level for all components. Default is 0, or silent except on error.
|
||||
.
|
||||
.TP
|
||||
crs_base_snapshot_dir
|
||||
The directory to store the checkpoint snapshots. Default is \fB/tmp\fP.
|
||||
.
|
||||
.\" Self Component
|
||||
.\" ******************
|
||||
.SS self CRS Component
|
||||
|
@ -285,17 +285,11 @@ int opal_crs_self_checkpoint(pid_t pid,
|
||||
/*
|
||||
* Setup for snapshot directory creation
|
||||
*/
|
||||
if(NULL != snapshot->super.reference_name)
|
||||
free(snapshot->super.reference_name);
|
||||
snapshot->super.reference_name = strdup(base_snapshot->reference_name);
|
||||
|
||||
if(NULL != snapshot->super.local_location)
|
||||
free(snapshot->super.local_location);
|
||||
snapshot->super.local_location = strdup(base_snapshot->local_location);
|
||||
|
||||
if(NULL != snapshot->super.remote_location)
|
||||
free(snapshot->super.remote_location);
|
||||
snapshot->super.remote_location = strdup(base_snapshot->remote_location);
|
||||
snapshot->super = *base_snapshot;
|
||||
#if 0
|
||||
snapshot->super.snapshot_directory = strdup(base_snapshot->snapshot_directory);
|
||||
snapshot->super.metadata_filename = strdup(base_snapshot->metadata_filename);
|
||||
#endif
|
||||
|
||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||
"crs:self: checkpoint(%d, ---)", pid);
|
||||
@ -310,13 +304,16 @@ int opal_crs_self_checkpoint(pid_t pid,
|
||||
* Update the snapshot metadata
|
||||
*/
|
||||
snapshot->super.component_name = strdup(mca_crs_self_component.super.base_version.mca_component_name);
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_metadata_write_token(NULL, CRS_METADATA_COMP, snapshot->super.component_name) ) ) {
|
||||
opal_output(mca_crs_self_component.super.output_handle,
|
||||
"crs:self: checkpoint(): Error: Unable to write component name to the directory for (%s).",
|
||||
snapshot->super.reference_name);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
if( NULL == snapshot->super.metadata ) {
|
||||
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) {
|
||||
opal_output(mca_crs_self_component.super.output_handle,
|
||||
"crs:self: checkpoint(): Error: Unable to open the file (%s)",
|
||||
snapshot->super.metadata_filename);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_COMP, snapshot->super.component_name);
|
||||
|
||||
/*
|
||||
* Call the user callback function
|
||||
@ -350,7 +347,7 @@ int opal_crs_self_checkpoint(pid_t pid,
|
||||
*state = OPAL_CRS_ERROR;
|
||||
opal_output(mca_crs_self_component.super.output_handle,
|
||||
"crs:self: checkpoint(): Error: Unable to update metadata for snapshot (%s).",
|
||||
snapshot->super.reference_name);
|
||||
snapshot->super.metadata_filename);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
@ -392,7 +389,7 @@ int opal_crs_self_restart(opal_crs_base_snapshot_t *base_snapshot, bool spawn_ch
|
||||
snapshot->super = *base_snapshot;
|
||||
|
||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||
"crs:self: restart(%s, %d)", snapshot->super.reference_name, spawn_child);
|
||||
"crs:self: restart(%d)", spawn_child);
|
||||
|
||||
/*
|
||||
* If we need to reconstruct the snapshot
|
||||
@ -675,16 +672,25 @@ static int self_cold_start(opal_crs_self_snapshot_t *snapshot) {
|
||||
int prev_pid;
|
||||
|
||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||
"crs:self: cold_start(%s)", snapshot->super.reference_name);
|
||||
"crs:self: cold_start()");
|
||||
|
||||
/*
|
||||
* Find the snapshot directory, read the metadata file
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.local_location,
|
||||
if( NULL == snapshot->super.metadata ) {
|
||||
if (NULL == (snapshot->super.metadata = fopen(snapshot->super.metadata_filename, "a")) ) {
|
||||
opal_output(mca_crs_self_component.super.output_handle,
|
||||
"crs:self: checkpoint(): Error: Unable to open the file (%s)",
|
||||
snapshot->super.metadata_filename);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
}
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(snapshot->super.metadata,
|
||||
&component_name, &prev_pid) ) ) {
|
||||
opal_output(mca_crs_self_component.super.output_handle,
|
||||
"crs:self: self_cold_start: Error: Failed to extract the metadata from the local snapshot (%s). Returned %d.",
|
||||
snapshot->super.local_location, ret);
|
||||
snapshot->super.metadata_filename, ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
@ -705,11 +711,11 @@ static int self_cold_start(opal_crs_self_snapshot_t *snapshot) {
|
||||
* Restart command
|
||||
* JJH: Command lines limited to 256 chars.
|
||||
*/
|
||||
opal_crs_base_metadata_read_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||
opal_crs_base_metadata_read_token(snapshot->super.metadata, CRS_METADATA_CONTEXT, &tmp_argv);
|
||||
if( NULL == tmp_argv ) {
|
||||
opal_output(mca_crs_self_component.super.output_handle,
|
||||
"crs:self: self_cold_start: Error: Failed to read the %s token from the local checkpoint in %s",
|
||||
CRS_METADATA_CONTEXT, snapshot->super.local_location);
|
||||
CRS_METADATA_CONTEXT, snapshot->super.snapshot_directory);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
@ -742,13 +748,13 @@ static int self_update_snapshot_metadata(opal_crs_self_snapshot_t *snapshot) {
|
||||
|
||||
opal_output_verbose(10, mca_crs_self_component.super.output_handle,
|
||||
"crs:self: update_snapshot_metadata(%s)",
|
||||
snapshot->super.reference_name);
|
||||
snapshot->super.metadata_filename);
|
||||
|
||||
/*
|
||||
* Append to the metadata file the command line to restart with
|
||||
* - How user wants us to restart
|
||||
*/
|
||||
opal_crs_base_metadata_write_token(snapshot->super.local_location, CRS_METADATA_CONTEXT, snapshot->cmd_line);
|
||||
fprintf(snapshot->super.metadata, "%s%s\n", CRS_METADATA_CONTEXT, snapshot->cmd_line);
|
||||
|
||||
cleanup:
|
||||
return exit_status;
|
||||
|
@ -74,9 +74,21 @@
|
||||
/******************
|
||||
* Global Var Decls
|
||||
******************/
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
static opal_thread_t **opal_cr_debug_free_threads = NULL;
|
||||
static int opal_cr_debug_num_free_threads = 0;
|
||||
static int opal_cr_debug_threads_already_waiting = false;
|
||||
|
||||
int MPIR_debug_with_checkpoint = 0;
|
||||
static volatile int MPIR_checkpoint_debug_gate = 0;
|
||||
|
||||
int opal_cr_debug_signal = 0;
|
||||
#endif
|
||||
|
||||
bool opal_cr_stall_check = false;
|
||||
bool opal_cr_currently_stalled = false;
|
||||
int opal_cr_output;
|
||||
int opal_cr_initalized = 0;
|
||||
|
||||
static double opal_cr_get_time(void);
|
||||
static void display_indv_timer_core(double diff, char *str);
|
||||
@ -89,10 +101,11 @@ int opal_cr_timing_target_rank = 0;
|
||||
/******************
|
||||
* Local Functions & Var Decls
|
||||
******************/
|
||||
static int extract_env_vars(int prev_pid);
|
||||
static int extract_env_vars(int prev_pid, char * file_name);
|
||||
|
||||
static void opal_cr_sigpipe_debug_signal_handler (int signo);
|
||||
|
||||
static opal_cr_user_inc_callback_fn_t cur_user_coord_callback[OMPI_CR_INC_MAX] = {NULL};
|
||||
static opal_cr_coord_callback_fn_t cur_coord_callback = NULL;
|
||||
static opal_cr_notify_callback_fn_t cur_notify_callback = NULL;
|
||||
|
||||
@ -179,13 +192,11 @@ int opal_cr_set_enabled(bool en)
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_cr_initalized = 0;
|
||||
|
||||
int opal_cr_init(void )
|
||||
{
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
opal_cr_coord_callback_fn_t prev_coord_func;
|
||||
int val;
|
||||
int val, t;
|
||||
|
||||
if( ++opal_cr_initalized != 1 ) {
|
||||
if( opal_cr_initalized < 1 ) {
|
||||
@ -265,9 +276,9 @@ int opal_cr_init(void )
|
||||
opal_cr_thread_sleep_check = val;
|
||||
|
||||
mca_base_param_reg_int_name("opal_cr", "thread_sleep_wait",
|
||||
"Time to sleep waiting for process to exit MPI library (Default: 0)",
|
||||
"Time to sleep waiting for process to exit MPI library (Default: 1000)",
|
||||
false, false,
|
||||
0, &val);
|
||||
1000, &val);
|
||||
opal_cr_thread_sleep_wait = val;
|
||||
|
||||
opal_output_verbose(10, opal_cr_output,
|
||||
@ -285,6 +296,19 @@ int opal_cr_init(void )
|
||||
opal_output_verbose(10, opal_cr_output,
|
||||
"opal_cr: init: Is a tool program: %d",
|
||||
val);
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
mca_base_param_reg_int_name("opal_cr", "enable_crdebug",
|
||||
"Enable checkpoint/restart debugging",
|
||||
false, false,
|
||||
0,
|
||||
&val);
|
||||
MPIR_debug_with_checkpoint = OPAL_INT_TO_BOOL(val);
|
||||
|
||||
opal_output_verbose(10, opal_cr_output,
|
||||
"opal_cr: init: C/R Debugging Enabled [%s]\n",
|
||||
(MPIR_debug_with_checkpoint ? "True": "False"));
|
||||
#endif
|
||||
|
||||
#ifndef __WINDOWS__
|
||||
mca_base_param_reg_int_name("opal_cr", "signal",
|
||||
"Checkpoint/Restart signal used to initialize an OPAL Only checkpoint of a program",
|
||||
@ -327,10 +351,36 @@ int opal_cr_init(void )
|
||||
opal_cr_is_tool = true; /* no support for CR on Windows yet */
|
||||
#endif /* __WINDOWS__ */
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
opal_cr_debug_num_free_threads = 3;
|
||||
opal_cr_debug_free_threads = (opal_thread_t **)malloc(sizeof(opal_thread_t *) * opal_cr_debug_num_free_threads );
|
||||
for(t = 0; t < opal_cr_debug_num_free_threads; ++t ) {
|
||||
opal_cr_debug_free_threads[t] = NULL;
|
||||
}
|
||||
|
||||
mca_base_param_reg_int_name("opal_cr", "crdebug_signal",
|
||||
"Checkpoint/Restart signal used to hold threads when debugging",
|
||||
false, false,
|
||||
SIGTSTP,
|
||||
&opal_cr_debug_signal);
|
||||
|
||||
opal_output_verbose(10, opal_cr_output,
|
||||
"opal_cr: init: Checkpoint Signal (Debug): %d",
|
||||
opal_cr_debug_signal);
|
||||
if( SIG_ERR == signal(opal_cr_debug_signal, MPIR_checkpoint_debugger_signal_handler) ) {
|
||||
opal_output(opal_cr_output,
|
||||
"opal_cr: init: Failed to register C/R debug signal (%d)",
|
||||
opal_cr_debug_signal);
|
||||
}
|
||||
#else
|
||||
/* Silence a compiler warning */
|
||||
t = 0;
|
||||
#endif
|
||||
|
||||
mca_base_param_reg_string_name("opal_cr", "tmp_dir",
|
||||
"Temporary directory to place rendezvous files for a checkpoint",
|
||||
false, false,
|
||||
"/tmp",
|
||||
opal_tmp_directory(),
|
||||
&opal_cr_pipe_dir);
|
||||
|
||||
opal_output_verbose(10, opal_cr_output,
|
||||
@ -436,6 +486,14 @@ int opal_cr_finalize(void)
|
||||
opal_cr_checkpoint_request = OPAL_CR_STATUS_TERM;
|
||||
}
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
if( NULL != opal_cr_debug_free_threads ) {
|
||||
free( opal_cr_debug_free_threads );
|
||||
opal_cr_debug_free_threads = NULL;
|
||||
}
|
||||
opal_cr_debug_num_free_threads = 0;
|
||||
#endif
|
||||
|
||||
if (NULL != opal_cr_pipe_dir) {
|
||||
free(opal_cr_pipe_dir);
|
||||
opal_cr_pipe_dir = NULL;
|
||||
@ -523,6 +581,14 @@ int opal_cr_inc_core_prep(void)
|
||||
{
|
||||
int ret;
|
||||
|
||||
/*
|
||||
* Call User Level INC
|
||||
*/
|
||||
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_PRE_CRS_PRE_MPI,
|
||||
OMPI_CR_INC_STATE_PREPARE)) ) {
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Use the registered coordination routine
|
||||
*/
|
||||
@ -535,6 +601,14 @@ int opal_cr_inc_core_prep(void)
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Call User Level INC
|
||||
*/
|
||||
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_PRE_CRS_POST_MPI,
|
||||
OMPI_CR_INC_STATE_PREPARE)) ) {
|
||||
return ret;
|
||||
}
|
||||
|
||||
core_prev_pid = getpid();
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
@ -575,7 +649,7 @@ int opal_cr_inc_core_ckpt(pid_t pid,
|
||||
* If restarting read environment stuff that opal-restart left us.
|
||||
*/
|
||||
if(*state == OPAL_CRS_RESTART) {
|
||||
extract_env_vars(core_prev_pid);
|
||||
opal_cr_refresh_environ(core_prev_pid);
|
||||
opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE;
|
||||
}
|
||||
|
||||
@ -585,6 +659,7 @@ int opal_cr_inc_core_ckpt(pid_t pid,
|
||||
int opal_cr_inc_core_recover(int state)
|
||||
{
|
||||
int ret;
|
||||
opal_cr_user_inc_callback_state_t cb_state;
|
||||
|
||||
if( opal_cr_checkpointing_state != OPAL_CR_STATUS_TERM &&
|
||||
opal_cr_checkpointing_state != OPAL_CR_STATUS_CONTINUE &&
|
||||
@ -599,11 +674,29 @@ int opal_cr_inc_core_recover(int state)
|
||||
* If restarting read environment stuff that opal-restart left us.
|
||||
*/
|
||||
else if(state == OPAL_CRS_RESTART) {
|
||||
extract_env_vars(core_prev_pid);
|
||||
opal_cr_refresh_environ(core_prev_pid);
|
||||
opal_cr_checkpointing_state = OPAL_CR_STATUS_RESTART_PRE;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Call User Level INC
|
||||
*/
|
||||
if( OPAL_CRS_CONTINUE == state ) {
|
||||
cb_state = OMPI_CR_INC_STATE_CONTINUE;
|
||||
}
|
||||
else if( OPAL_CRS_RESTART == state ) {
|
||||
cb_state = OMPI_CR_INC_STATE_RESTART;
|
||||
}
|
||||
else {
|
||||
cb_state = OMPI_CR_INC_STATE_ERROR;
|
||||
}
|
||||
|
||||
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_POST_CRS_PRE_MPI,
|
||||
cb_state)) ) {
|
||||
return ret;
|
||||
}
|
||||
|
||||
/*
|
||||
* Use the registered coordination routine
|
||||
*/
|
||||
@ -616,6 +709,15 @@ int opal_cr_inc_core_recover(int state)
|
||||
return ret;
|
||||
}
|
||||
|
||||
if(OPAL_SUCCESS != (ret = trigger_user_inc_callback(OMPI_CR_INC_POST_CRS_POST_MPI,
|
||||
cb_state)) ) {
|
||||
return ret;
|
||||
}
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
opal_cr_debug_clear_current_ckpt_thread();
|
||||
#endif
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
@ -717,6 +819,39 @@ int opal_cr_reg_notify_callback(opal_cr_notify_callback_fn_t new_func,
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_cr_user_inc_register_callback(opal_cr_user_inc_callback_event_t event,
|
||||
opal_cr_user_inc_callback_fn_t function,
|
||||
opal_cr_user_inc_callback_fn_t *prev_function)
|
||||
{
|
||||
if( event < 0 || event >= OMPI_CR_INC_MAX ) {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
if( NULL != cur_user_coord_callback[event] ) {
|
||||
*prev_function = cur_user_coord_callback[event];
|
||||
} else {
|
||||
*prev_function = NULL;
|
||||
}
|
||||
|
||||
cur_user_coord_callback[event] = function;
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int trigger_user_inc_callback(opal_cr_user_inc_callback_event_t event,
|
||||
opal_cr_user_inc_callback_state_t state)
|
||||
{
|
||||
if( NULL == cur_user_coord_callback[event] ) {
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
if( event < 0 || event >= OMPI_CR_INC_MAX ) {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
return ((cur_user_coord_callback[event])(event, state));
|
||||
}
|
||||
|
||||
int opal_cr_reg_coord_callback(opal_cr_coord_callback_fn_t new_func,
|
||||
opal_cr_coord_callback_fn_t *prev_func)
|
||||
{
|
||||
@ -738,14 +873,61 @@ int opal_cr_reg_coord_callback(opal_cr_coord_callback_fn_t new_func,
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_cr_refresh_environ(int prev_pid) {
|
||||
int val;
|
||||
char *file_name = NULL;
|
||||
struct stat file_status;
|
||||
|
||||
if( 0 >= prev_pid ) {
|
||||
prev_pid = getpid();
|
||||
}
|
||||
|
||||
/*
|
||||
* Make sure the file exists. If it doesn't then this means 2 things:
|
||||
* 1) We have already executed this function, and
|
||||
* 2) The file has been deleted on the previous round.
|
||||
*/
|
||||
asprintf(&file_name, "%s/%s-%d", opal_tmp_directory(), OPAL_CR_BASE_ENV_NAME, prev_pid);
|
||||
if(0 != stat(file_name, &file_status) ){
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
opal_unsetenv(mca_base_param_env_var("opal_cr_enable_crdebug"), &environ);
|
||||
#endif
|
||||
|
||||
extract_env_vars(prev_pid, file_name);
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
mca_base_param_reg_int_name("opal_cr", "enable_crdebug",
|
||||
"Enable checkpoint/restart debugging",
|
||||
false, false,
|
||||
0,
|
||||
&val);
|
||||
MPIR_debug_with_checkpoint = OPAL_INT_TO_BOOL(val);
|
||||
|
||||
opal_output_verbose(10, opal_cr_output,
|
||||
"opal_cr: init: C/R Debugging Enabled [%s] (refresh)\n",
|
||||
(MPIR_debug_with_checkpoint ? "True": "False"));
|
||||
#else
|
||||
val = 0; /* Silence Compiler warning */
|
||||
#endif
|
||||
|
||||
if( NULL != file_name ){
|
||||
free(file_name);
|
||||
file_name = NULL;
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* Extract environment variables from a saved file
|
||||
* and place them in the environment.
|
||||
*/
|
||||
static int extract_env_vars(int prev_pid)
|
||||
static int extract_env_vars(int prev_pid, char * file_name)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
char *file_name = NULL;
|
||||
FILE *env_data = NULL;
|
||||
int len = OPAL_PATH_MAX;
|
||||
char * tmp_str = NULL;
|
||||
@ -758,12 +940,6 @@ static int extract_env_vars(int prev_pid)
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* JJH: Hardcode /tmp here, really only need an agreed upon file to
|
||||
* transfer the environment variables.
|
||||
*/
|
||||
asprintf(&file_name, "/tmp/%s-%d", OPAL_CR_BASE_ENV_NAME, prev_pid);
|
||||
|
||||
if (NULL == (env_data = fopen(file_name, "r")) ) {
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
@ -805,17 +981,12 @@ static int extract_env_vars(int prev_pid)
|
||||
tmp_str = NULL;
|
||||
}
|
||||
|
||||
|
||||
cleanup:
|
||||
if( NULL != env_data ) {
|
||||
fclose(env_data);
|
||||
}
|
||||
unlink(file_name);
|
||||
|
||||
if( NULL != file_name ){
|
||||
free(file_name);
|
||||
}
|
||||
|
||||
if( NULL != tmp_str ){
|
||||
free(tmp_str);
|
||||
}
|
||||
@ -871,6 +1042,10 @@ static void* opal_cr_thread_fn(opal_object_t *obj)
|
||||
}
|
||||
}
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
opal_cr_debug_free_threads[1] = opal_thread_get_self();
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Wait to become active
|
||||
*/
|
||||
@ -1106,3 +1281,129 @@ void opal_cr_display_all_timers(void)
|
||||
|
||||
opal_output(0, "OPAL CR Timing: ******************** Summary End\n");
|
||||
}
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
int opal_cr_debug_set_current_ckpt_thread_self(void)
|
||||
{
|
||||
int t;
|
||||
|
||||
if( NULL == opal_cr_debug_free_threads ) {
|
||||
opal_cr_debug_num_free_threads = 3;
|
||||
opal_cr_debug_free_threads = (opal_thread_t **)malloc(sizeof(opal_thread_t *) * opal_cr_debug_num_free_threads );
|
||||
for(t = 0; t < opal_cr_debug_num_free_threads; ++t ) {
|
||||
opal_cr_debug_free_threads[t] = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
opal_cr_debug_free_threads[0] = opal_thread_get_self();
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int opal_cr_debug_clear_current_ckpt_thread(void)
|
||||
{
|
||||
opal_cr_debug_free_threads[0] = NULL;
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
int MPIR_checkpoint_debugger_detach(void) {
|
||||
/* This function is meant to be a noop function for checkpoint/restart
|
||||
* enabled debugging functionality */
|
||||
#if 0
|
||||
/* Once the debugger can successfully force threads into the function below,
|
||||
* then we can uncomment this line */
|
||||
if( MPIR_debug_with_checkpoint ) {
|
||||
opal_cr_debug_threads_already_waiting = true;
|
||||
}
|
||||
#endif
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
void MPIR_checkpoint_debugger_signal_handler(int signo)
|
||||
{
|
||||
opal_output_verbose(1, opal_cr_output,
|
||||
"crs: MPIR_checkpoint_debugger_signal_handler(): Enter Debug signal handler...");
|
||||
|
||||
MPIR_checkpoint_debugger_waitpoint();
|
||||
|
||||
opal_output_verbose(1, opal_cr_output,
|
||||
"crs: MPIR_checkpoint_debugger_signal_handler(): Leave Debug signal handler...");
|
||||
}
|
||||
|
||||
void *MPIR_checkpoint_debugger_waitpoint(void)
|
||||
{
|
||||
int t;
|
||||
opal_thread_t *thr = NULL;
|
||||
|
||||
thr = opal_thread_get_self();
|
||||
|
||||
/*
|
||||
* Sanity check, if the debugger is not going to attach, then do not wait
|
||||
* Make sure to open the debug gate, so that threads can get out
|
||||
*/
|
||||
if( !MPIR_debug_with_checkpoint ) {
|
||||
opal_output_verbose(1, opal_cr_output,
|
||||
"crs: MPIR_checkpoint_debugger_waitpoint(): Debugger is not attaching... (%d)",
|
||||
(int)thr->t_handle);
|
||||
MPIR_checkpoint_debug_gate = 1;
|
||||
return NULL;
|
||||
}
|
||||
else {
|
||||
opal_output_verbose(1, opal_cr_output,
|
||||
"crs: MPIR_checkpoint_debugger_waitpoint(): Waiting for the Debugger to attach... (%d)",
|
||||
(int)thr->t_handle);
|
||||
MPIR_checkpoint_debug_gate = 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Let special threads escape without waiting, they will wait later
|
||||
*/
|
||||
for(t = 0; t < opal_cr_debug_num_free_threads; ++t) {
|
||||
if( opal_cr_debug_free_threads[t] != NULL &&
|
||||
opal_thread_self_compare(opal_cr_debug_free_threads[t]) ) {
|
||||
opal_output_verbose(1, opal_cr_output,
|
||||
"crs: MPIR_checkpoint_debugger_waitpoint(): Checkpointing thread does not wait here... (%d)",
|
||||
(int)thr->t_handle);
|
||||
return NULL;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Force all other threads into the waiting function,
|
||||
* unless they are already in there, then just return so we do not nest
|
||||
* calls into this wait function and potentially confuse the debugger.
|
||||
*/
|
||||
if( opal_cr_debug_threads_already_waiting ) {
|
||||
opal_output_verbose(1, opal_cr_output,
|
||||
"crs: MPIR_checkpoint_debugger_waitpoint(): Threads are already waiting from debugger detach, do not wait here... (%d)",
|
||||
(int)thr->t_handle);
|
||||
return NULL;
|
||||
} else {
|
||||
opal_output_verbose(1, opal_cr_output,
|
||||
"crs: MPIR_checkpoint_debugger_waitpoint(): Wait... (%d)",
|
||||
(int)thr->t_handle);
|
||||
return MPIR_checkpoint_debugger_breakpoint();
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* A tight loop to wait for debugger to release this process from the
|
||||
* breakpoint.
|
||||
*/
|
||||
void *MPIR_checkpoint_debugger_breakpoint(void)
|
||||
{
|
||||
/* spin until debugger attaches and releases us */
|
||||
while (MPIR_checkpoint_debug_gate == 0) {
|
||||
#if defined(__WINDOWS__)
|
||||
Sleep(100); /* milliseconds */
|
||||
#elif defined(HAVE_USLEEP)
|
||||
usleep(100000); /* microseconds */
|
||||
#else
|
||||
sleep(1); /* seconds */
|
||||
#endif
|
||||
}
|
||||
opal_cr_debug_threads_already_waiting = false;
|
||||
return NULL;
|
||||
}
|
||||
#endif
|
||||
|
@ -91,6 +91,44 @@ typedef enum opal_cr_ckpt_cmd_state_t opal_cr_ckpt_cmd_state_t;
|
||||
/* The current state of a checkpoint operation */
|
||||
OPAL_DECLSPEC extern int opal_cr_checkpointing_state;
|
||||
|
||||
#if OPAL_ENABLE_CRDEBUG == 1
|
||||
/* Whether or not C/R Debugging is enabled for this process */
|
||||
OPAL_DECLSPEC extern int MPIR_debug_with_checkpoint;
|
||||
|
||||
/*
|
||||
* Set/clear the current thread id for the checkpointing thread
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_cr_debug_set_current_ckpt_thread_self(void);
|
||||
OPAL_DECLSPEC int opal_cr_debug_clear_current_ckpt_thread(void);
|
||||
|
||||
/*
|
||||
* This MPI Debugger function needs to be accessed here and have a specific
|
||||
* name. Thus we are breaking the traditional naming conventions to provide this functionality.
|
||||
*/
|
||||
OPAL_DECLSPEC int MPIR_checkpoint_debugger_detach(void);
|
||||
|
||||
/**
|
||||
* A tight loop to wait for debugger to release this process from the
|
||||
* breakpoint.
|
||||
*/
|
||||
OPAL_DECLSPEC void *MPIR_checkpoint_debugger_breakpoint(void);
|
||||
|
||||
/**
|
||||
* A function for the debugger or CRS to force all threads into
|
||||
*/
|
||||
OPAL_DECLSPEC void *MPIR_checkpoint_debugger_waitpoint(void);
|
||||
|
||||
/**
|
||||
* A signal handler to force all threads to wait when debugger detaches
|
||||
*/
|
||||
OPAL_DECLSPEC void MPIR_checkpoint_debugger_signal_handler(int signo);
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Refresh environment variables after a restart
|
||||
*/
|
||||
OPAL_DECLSPEC int opal_cr_refresh_environ(int prev_pid);
|
||||
|
||||
/*
|
||||
* If this is an application that doesn't want to have
|
||||
* a notification callback installed, set this to false.
|
||||
@ -253,6 +291,42 @@ typedef enum opal_cr_ckpt_cmd_state_t opal_cr_ckpt_cmd_state_t;
|
||||
int *state);
|
||||
OPAL_DECLSPEC int opal_cr_inc_core_recover(int state);
|
||||
|
||||
|
||||
/*******************************
|
||||
* User Coordination Routines
|
||||
*******************************/
|
||||
typedef enum {
|
||||
OMPI_CR_INC_PRE_CRS_PRE_MPI = 0,
|
||||
OMPI_CR_INC_PRE_CRS_POST_MPI = 1,
|
||||
OMPI_CR_INC_CRS_PRE_CKPT = 2,
|
||||
OMPI_CR_INC_CRS_POST_CKPT = 3,
|
||||
OMPI_CR_INC_POST_CRS_PRE_MPI = 4,
|
||||
OMPI_CR_INC_POST_CRS_POST_MPI = 5,
|
||||
OMPI_CR_INC_MAX = 6
|
||||
} opal_cr_user_inc_callback_event_t;
|
||||
|
||||
typedef enum {
|
||||
OMPI_CR_INC_STATE_PREPARE = 0,
|
||||
OMPI_CR_INC_STATE_CONTINUE = 1,
|
||||
OMPI_CR_INC_STATE_RESTART = 2,
|
||||
OMPI_CR_INC_STATE_ERROR = 3
|
||||
} opal_cr_user_inc_callback_state_t;
|
||||
|
||||
/**
|
||||
* User coordination callback routine
|
||||
*/
|
||||
typedef int (*opal_cr_user_inc_callback_fn_t)(opal_cr_user_inc_callback_event_t event,
|
||||
opal_cr_user_inc_callback_state_t state);
|
||||
|
||||
OPAL_DECLSPEC int opal_cr_user_inc_register_callback
|
||||
(opal_cr_user_inc_callback_event_t event,
|
||||
opal_cr_user_inc_callback_fn_t function,
|
||||
opal_cr_user_inc_callback_fn_t *prev_function);
|
||||
|
||||
OPAL_DECLSPEC int trigger_user_inc_callback(opal_cr_user_inc_callback_event_t event,
|
||||
opal_cr_user_inc_callback_state_t state);
|
||||
|
||||
|
||||
/*******************************
|
||||
* Coordination Routines
|
||||
*******************************/
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -43,6 +43,9 @@
|
||||
#include "opal/event/event.h"
|
||||
#include "opal/runtime/opal_progress.h"
|
||||
#include "opal/mca/carto/base/base.h"
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#endif
|
||||
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
@ -112,6 +115,10 @@ opal_finalize(void)
|
||||
/* close the checkpoint and restart service */
|
||||
opal_cr_finalize();
|
||||
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
opal_compress_base_close();
|
||||
#endif
|
||||
|
||||
opal_progress_finalize();
|
||||
|
||||
opal_event_fini();
|
||||
|
@ -1,5 +1,5 @@
|
||||
/*
|
||||
* Copyright (c) 2004-2007 The Trustees of Indiana University and Indiana
|
||||
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
* University Research and Technology
|
||||
* Corporation. All rights reserved.
|
||||
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -40,6 +40,9 @@
|
||||
#include "opal/mca/memchecker/base/base.h"
|
||||
#include "opal/dss/dss.h"
|
||||
#include "opal/mca/carto/base/base.h"
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
#endif
|
||||
|
||||
#include "opal/runtime/opal_cr.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
@ -425,6 +428,23 @@ opal_init(int* pargc, char*** pargv)
|
||||
/* we want to tick the event library whenever possible */
|
||||
opal_progress_event_users_increment();
|
||||
|
||||
#if OPAL_ENABLE_FT_CR == 1
|
||||
/*
|
||||
* Initialize the compression framework
|
||||
* Note: Currently only used in C/R so it has been marked to only
|
||||
* initialize when C/R is enabled. If other places in the code
|
||||
* wish to use this framework, it is safe to remove the protection.
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = opal_compress_base_open()) ) {
|
||||
error = "opal_compress_base_open() failed";
|
||||
goto return_error;
|
||||
}
|
||||
if( OPAL_SUCCESS != (ret = opal_compress_base_select()) ) {
|
||||
error = "opal_compress_base_select() failed";
|
||||
goto return_error;
|
||||
}
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Initalize the checkpoint/restart functionality
|
||||
* Note: Always do this so we can detect if the user
|
||||
|
@ -21,7 +21,7 @@
|
||||
# This is the US/English help file for Open MPI checkpoint tool
|
||||
#
|
||||
[usage]
|
||||
opal-restart FILENAME
|
||||
opal-restart -r FILENAME
|
||||
Open PAL Single Process Restart Tool
|
||||
|
||||
%s
|
||||
@ -70,3 +70,10 @@ Error: The restart command failed to properly exec the process per
|
||||
|
||||
Expected Component: %s
|
||||
Selected Component: %s
|
||||
|
||||
[cache_not_avail]
|
||||
Warning: Recommended cache directory could not be accessed. Falling back
|
||||
to the snapshot location.
|
||||
Cache Dir : %s
|
||||
Snapshot Dir: %s
|
||||
|
||||
|
@ -61,6 +61,7 @@
|
||||
#include "opal/util/show_help.h"
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/util/opal_environ.h"
|
||||
#include "opal/util/basename.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
#include "opal/mca/base/mca_base_param.h"
|
||||
|
||||
@ -70,14 +71,17 @@
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
|
||||
#include "opal/mca/compress/compress.h"
|
||||
#include "opal/mca/compress/base/base.h"
|
||||
|
||||
/******************
|
||||
* Local Functions
|
||||
******************/
|
||||
static int initialize(int argc, char *argv[]);
|
||||
static int finalize(void);
|
||||
static int parse_args(int argc, char *argv[]);
|
||||
static int check_file(char *given_filename);
|
||||
static int post_env_vars(int prev_pid, char *location);
|
||||
static int check_file(void);
|
||||
static int post_env_vars(int prev_pid, opal_crs_base_snapshot_t *snapshot);
|
||||
|
||||
/*****************************************
|
||||
* Global Vars for Command line Arguments
|
||||
@ -86,10 +90,13 @@ static char *expected_crs_comp = NULL;
|
||||
|
||||
typedef struct {
|
||||
bool help;
|
||||
char *filename;
|
||||
bool verbose;
|
||||
bool forked;
|
||||
char *snapshot_ref;
|
||||
char *snapshot_loc;
|
||||
char *snapshot_metadata;
|
||||
char *snapshot_cache;
|
||||
char *snapshot_compress;
|
||||
char *snapshot_compress_postfix;
|
||||
int output;
|
||||
} opal_restart_globals_t;
|
||||
|
||||
@ -109,19 +116,40 @@ opal_cmd_line_init_t cmd_line_opts[] = {
|
||||
"Be Verbose" },
|
||||
|
||||
{ NULL, NULL, NULL,
|
||||
'\0', NULL, "fork",
|
||||
0,
|
||||
&opal_restart_globals.forked, OPAL_CMD_LINE_TYPE_BOOL,
|
||||
"Fork off a new process which is the restarted process instead of "
|
||||
"replacing opal_restart" },
|
||||
|
||||
{ "crs", "base", "snapshot_dir",
|
||||
'w', NULL, "where",
|
||||
'l', NULL, "location",
|
||||
1,
|
||||
&opal_restart_globals.snapshot_loc, OPAL_CMD_LINE_TYPE_STRING,
|
||||
"Where to find the checkpoint files. In most cases this is automatically "
|
||||
"detected, however if a custom location was specified to opal-checkpoint "
|
||||
"then this argument is meant to match it."},
|
||||
"Full path to the location of the local snapshot."},
|
||||
|
||||
{ NULL, NULL, NULL,
|
||||
'm', NULL, "metadata",
|
||||
1,
|
||||
&opal_restart_globals.snapshot_metadata, OPAL_CMD_LINE_TYPE_STRING,
|
||||
"Relative path (with respect to --location) to the metadata file."},
|
||||
|
||||
{ NULL, NULL, NULL,
|
||||
'r', NULL, "reference",
|
||||
1,
|
||||
&opal_restart_globals.snapshot_ref, OPAL_CMD_LINE_TYPE_STRING,
|
||||
"Local snapshot reference."},
|
||||
|
||||
{ NULL, NULL, NULL,
|
||||
'c', NULL, "cache",
|
||||
1,
|
||||
&opal_restart_globals.snapshot_cache, OPAL_CMD_LINE_TYPE_STRING,
|
||||
"Possible local cache of the snapshot reference."},
|
||||
|
||||
{ NULL, NULL, NULL,
|
||||
'd', NULL, "decompress",
|
||||
1,
|
||||
&opal_restart_globals.snapshot_compress, OPAL_CMD_LINE_TYPE_STRING,
|
||||
"Decompression component to use."},
|
||||
|
||||
{ NULL, NULL, NULL,
|
||||
'p', NULL, "decompress_postfix",
|
||||
1,
|
||||
&opal_restart_globals.snapshot_compress_postfix, OPAL_CMD_LINE_TYPE_STRING,
|
||||
"Decompression component postfix."},
|
||||
|
||||
/* End of list */
|
||||
{ NULL, NULL, NULL,
|
||||
@ -151,9 +179,9 @@ main(int argc, char *argv[])
|
||||
/*
|
||||
* Check for existence of the file, or program in the case of self
|
||||
*/
|
||||
if( OPAL_SUCCESS != (ret = check_file(opal_restart_globals.filename) )) {
|
||||
if( OPAL_SUCCESS != (ret = check_file() )) {
|
||||
opal_show_help("help-opal-restart.txt", "invalid_filename", true,
|
||||
opal_restart_globals.filename);
|
||||
opal_restart_globals.snapshot_ref);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
@ -170,19 +198,35 @@ main(int argc, char *argv[])
|
||||
* Make sure we are using the correct checkpointer
|
||||
*/
|
||||
if(NULL == expected_crs_comp) {
|
||||
char * base = NULL;
|
||||
char * full_metadata_path = NULL;
|
||||
FILE * metadata = NULL;
|
||||
|
||||
base = opal_crs_base_get_snapshot_directory(opal_restart_globals.filename);
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(base,
|
||||
asprintf(&full_metadata_path, "%s/%s/%s",
|
||||
opal_restart_globals.snapshot_loc,
|
||||
opal_restart_globals.snapshot_ref,
|
||||
opal_restart_globals.snapshot_metadata);
|
||||
if( NULL == (metadata = fopen(full_metadata_path, "r")) ) {
|
||||
opal_show_help("help-opal-restart.txt", "invalid_metadata", true,
|
||||
opal_restart_globals.snapshot_metadata,
|
||||
full_metadata_path);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
if( OPAL_SUCCESS != (ret = opal_crs_base_extract_expected_component(metadata,
|
||||
&expected_crs_comp,
|
||||
&prev_pid)) ) {
|
||||
opal_show_help("help-opal-restart.txt", "invalid_metadata", true,
|
||||
opal_crs_base_metadata_filename, base);
|
||||
opal_restart_globals.snapshot_metadata,
|
||||
full_metadata_path);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
free(base);
|
||||
free(full_metadata_path);
|
||||
full_metadata_path = NULL;
|
||||
|
||||
fclose(metadata);
|
||||
metadata = NULL;
|
||||
}
|
||||
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
@ -235,21 +279,17 @@ main(int argc, char *argv[])
|
||||
* Restart in this process
|
||||
******************************/
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
"Restarting from file (%s)",
|
||||
opal_restart_globals.filename);
|
||||
if( opal_restart_globals.forked ) {
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
"\t Forking off a child");
|
||||
} else {
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
"\t Exec in self");
|
||||
}
|
||||
"Restarting from file (%s)\n",
|
||||
opal_restart_globals.snapshot_ref);
|
||||
|
||||
snapshot = OBJ_NEW(opal_crs_base_snapshot_t);
|
||||
snapshot->cold_start = true;
|
||||
snapshot->reference_name = strdup(opal_restart_globals.filename);
|
||||
snapshot->local_location = opal_crs_base_get_snapshot_directory(snapshot->reference_name);
|
||||
snapshot->remote_location = strdup(snapshot->local_location);
|
||||
snapshot->cold_start = true;
|
||||
asprintf(&(snapshot->snapshot_directory), "%s/%s",
|
||||
opal_restart_globals.snapshot_loc,
|
||||
opal_restart_globals.snapshot_ref);
|
||||
asprintf(&(snapshot->metadata_filename), "%s/%s",
|
||||
snapshot->snapshot_directory,
|
||||
opal_restart_globals.snapshot_metadata);
|
||||
|
||||
/* Since some checkpoint/restart systems don't pass along env vars to the
|
||||
* restarted app, we need to take care of that.
|
||||
@ -257,7 +297,7 @@ main(int argc, char *argv[])
|
||||
* Included here is the creation of any files or directories that need to be
|
||||
* created before the process is restarted.
|
||||
*/
|
||||
if(OPAL_SUCCESS != (ret = post_env_vars(prev_pid, snapshot->local_location) ) ) {
|
||||
if(OPAL_SUCCESS != (ret = post_env_vars(prev_pid, snapshot) ) ) {
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
@ -266,27 +306,16 @@ main(int argc, char *argv[])
|
||||
* Do the actual restart
|
||||
*/
|
||||
ret = opal_crs.crs_restart(snapshot,
|
||||
opal_restart_globals.forked,
|
||||
false,
|
||||
&child_pid);
|
||||
|
||||
if (OPAL_SUCCESS != ret) {
|
||||
opal_show_help("help-opal-restart.txt", "restart_cmd_failure", true,
|
||||
opal_restart_globals.filename, ret);
|
||||
opal_restart_globals.snapshot_ref, ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/* If we required it to exec in self, then fail if this function returns. */
|
||||
if(!opal_restart_globals.forked) {
|
||||
opal_show_help("help-opal-restart.txt", "failed-to-exec", true,
|
||||
expected_crs_comp,
|
||||
opal_crs_base_selected_component.base_version.mca_component_name);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
"opal_restart: Restarted Child with PID = %d\n", child_pid);
|
||||
/* Should never get here, since crs_restart calls exec */
|
||||
|
||||
/***************
|
||||
* Cleanup
|
||||
@ -320,8 +349,8 @@ static int initialize(int argc, char *argv[])
|
||||
* Parse Command line arguments
|
||||
*/
|
||||
if (OPAL_SUCCESS != (ret = parse_args(argc, argv))) {
|
||||
goto cleanup;
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
@ -345,6 +374,18 @@ static int initialize(int argc, char *argv[])
|
||||
free(tmp_env_var);
|
||||
tmp_env_var = NULL;
|
||||
|
||||
/*
|
||||
* Make sure we select the proper compress component.
|
||||
*/
|
||||
if( NULL != opal_restart_globals.snapshot_compress ) {
|
||||
tmp_env_var = mca_base_param_env_var("compress");
|
||||
opal_setenv(tmp_env_var,
|
||||
opal_restart_globals.snapshot_compress,
|
||||
true, &environ);
|
||||
free(tmp_env_var);
|
||||
tmp_env_var = NULL;
|
||||
}
|
||||
|
||||
/*
|
||||
* Initialize the OPAL layer
|
||||
*/
|
||||
@ -353,6 +394,72 @@ static int initialize(int argc, char *argv[])
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* If the checkpoint was compressed, then decompress it before continuing
|
||||
*/
|
||||
if( NULL != opal_restart_globals.snapshot_compress ) {
|
||||
char * zip_dir = NULL;
|
||||
char * tmp_str = NULL;
|
||||
|
||||
/* Make sure to clear the selection for the restart,
|
||||
* this way the user can swich compression mechanism
|
||||
* across restart
|
||||
*/
|
||||
tmp_env_var = mca_base_param_env_var("compress");
|
||||
opal_unsetenv(tmp_env_var, &environ);
|
||||
free(tmp_env_var);
|
||||
tmp_env_var = NULL;
|
||||
|
||||
asprintf(&zip_dir, "%s/%s%s",
|
||||
opal_restart_globals.snapshot_loc,
|
||||
opal_restart_globals.snapshot_ref,
|
||||
opal_restart_globals.snapshot_compress_postfix);
|
||||
|
||||
if (0 > (ret = access(zip_dir, F_OK)) ) {
|
||||
opal_output(opal_restart_globals.output,
|
||||
"Error: Unable to access the file [%s]!",
|
||||
zip_dir);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
"Decompressing (%s)",
|
||||
zip_dir);
|
||||
|
||||
opal_compress.decompress(zip_dir, &tmp_str);
|
||||
|
||||
if( NULL != zip_dir ) {
|
||||
free(zip_dir);
|
||||
zip_dir = NULL;
|
||||
}
|
||||
if( NULL != tmp_str ) {
|
||||
free(tmp_str);
|
||||
tmp_str = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* If a cache directory has been suggested, see if it exists
|
||||
*/
|
||||
if( NULL != opal_restart_globals.snapshot_cache ) {
|
||||
if(0 == (ret = access(opal_restart_globals.snapshot_cache, F_OK)) ) {
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
"Using the cached snapshot (%s) instead of (%s)",
|
||||
opal_restart_globals.snapshot_cache,
|
||||
opal_restart_globals.snapshot_loc);
|
||||
if( NULL != opal_restart_globals.snapshot_loc ) {
|
||||
free(opal_restart_globals.snapshot_loc);
|
||||
opal_restart_globals.snapshot_loc = NULL;
|
||||
}
|
||||
opal_restart_globals.snapshot_loc = opal_dirname(opal_restart_globals.snapshot_cache);
|
||||
} else {
|
||||
opal_show_help("help-opal-restart.txt", "cache_not_avail", true,
|
||||
opal_restart_globals.snapshot_cache,
|
||||
opal_restart_globals.snapshot_loc);
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Mark this process as a tool
|
||||
*/
|
||||
@ -380,10 +487,13 @@ static int parse_args(int argc, char *argv[])
|
||||
char **app_env = NULL, **global_env = NULL;
|
||||
|
||||
opal_restart_globals.help = false;
|
||||
opal_restart_globals.filename = NULL;
|
||||
opal_restart_globals.verbose = false;
|
||||
opal_restart_globals.forked = false;
|
||||
opal_restart_globals.snapshot_ref = NULL;
|
||||
opal_restart_globals.snapshot_loc = NULL;
|
||||
opal_restart_globals.snapshot_metadata = NULL;
|
||||
opal_restart_globals.snapshot_cache = NULL;
|
||||
opal_restart_globals.snapshot_compress = NULL;
|
||||
opal_restart_globals.snapshot_compress_postfix = NULL;
|
||||
opal_restart_globals.output = 0;
|
||||
|
||||
/* Parse the command line options */
|
||||
@ -412,8 +522,7 @@ static int parse_args(int argc, char *argv[])
|
||||
* Now start parsing our specific arguments
|
||||
*/
|
||||
if (OPAL_SUCCESS != ret ||
|
||||
opal_restart_globals.help ||
|
||||
1 >= argc) {
|
||||
opal_restart_globals.help ) {
|
||||
char *args = NULL;
|
||||
args = opal_cmd_line_get_usage_msg(&cmd_line);
|
||||
opal_show_help("help-opal-restart.txt", "usage", true,
|
||||
@ -424,20 +533,11 @@ static int parse_args(int argc, char *argv[])
|
||||
|
||||
/* get the remaining bits */
|
||||
opal_cmd_line_get_tail(&cmd_line, &argc, &argv);
|
||||
if ( 1 > argc ) {
|
||||
char *args = NULL;
|
||||
args = opal_cmd_line_get_usage_msg(&cmd_line);
|
||||
opal_show_help("help-opal-restart.txt", "usage", true,
|
||||
args);
|
||||
free(args);
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
opal_restart_globals.filename = strdup(argv[0]);
|
||||
if ( NULL == opal_restart_globals.filename ||
|
||||
0 >= strlen(opal_restart_globals.filename) ) {
|
||||
if ( NULL == opal_restart_globals.snapshot_ref ||
|
||||
0 >= strlen(opal_restart_globals.snapshot_ref) ) {
|
||||
opal_show_help("help-opal-restart.txt", "invalid_filename", true,
|
||||
opal_restart_globals.filename);
|
||||
opal_restart_globals.snapshot_ref);
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
|
||||
@ -445,21 +545,20 @@ static int parse_args(int argc, char *argv[])
|
||||
* need to be grouped together.
|
||||
* Useful in the 'mca crs self' instance.
|
||||
*/
|
||||
if(argc > 1) {
|
||||
opal_restart_globals.filename = strdup(opal_argv_join(argv, ' '));
|
||||
if(argc > 0) {
|
||||
opal_restart_globals.snapshot_ref = strdup(opal_argv_join(argv, ' '));
|
||||
}
|
||||
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
static int check_file(char *given_filename)
|
||||
static int check_file(void)
|
||||
{
|
||||
int exit_status = OPAL_SUCCESS;
|
||||
int ret;
|
||||
char * path_to_check = NULL;
|
||||
char **argv = NULL;
|
||||
|
||||
if(NULL == given_filename) {
|
||||
if(NULL == opal_restart_globals.snapshot_ref) {
|
||||
opal_output(opal_restart_globals.output,
|
||||
"Error: No filename provided!");
|
||||
exit_status = OPAL_ERROR;
|
||||
@ -469,9 +568,10 @@ static int check_file(char *given_filename)
|
||||
/*
|
||||
* Check for the existance of the snapshot handle in the snapshot directory
|
||||
*/
|
||||
path_to_check = opal_crs_base_get_snapshot_directory(given_filename);
|
||||
asprintf(&path_to_check, "%s/%s",
|
||||
opal_restart_globals.snapshot_loc,
|
||||
opal_restart_globals.snapshot_ref);
|
||||
|
||||
/* Do the check */
|
||||
opal_output_verbose(10, opal_restart_globals.output,
|
||||
"Checking for the existence of (%s)",
|
||||
path_to_check);
|
||||
@ -485,15 +585,15 @@ static int check_file(char *given_filename)
|
||||
}
|
||||
|
||||
cleanup:
|
||||
if( NULL != path_to_check)
|
||||
if( NULL != path_to_check) {
|
||||
free(path_to_check);
|
||||
if( NULL != argv)
|
||||
opal_argv_free(argv);
|
||||
path_to_check = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
static int post_env_vars(int prev_pid, char *location)
|
||||
static int post_env_vars(int prev_pid, opal_crs_base_snapshot_t *snapshot)
|
||||
{
|
||||
int ret, exit_status = OPAL_SUCCESS;
|
||||
char *command = NULL;
|
||||
@ -511,11 +611,10 @@ static int post_env_vars(int prev_pid, char *location)
|
||||
}
|
||||
|
||||
/*
|
||||
* JJH: Hardcode /tmp to match opal/runtime/opal_cr.c in the application.
|
||||
* This is needed so we can pass the previous environment to the restarted
|
||||
* application process.
|
||||
*/
|
||||
asprintf(&proc_file, "/tmp/%s-%d", OPAL_CR_BASE_ENV_NAME, prev_pid);
|
||||
asprintf(&proc_file, "%s/%s-%d", opal_tmp_directory(), OPAL_CR_BASE_ENV_NAME, prev_pid);
|
||||
asprintf(&command, "env | grep OMPI_ > %s", proc_file);
|
||||
|
||||
opal_output_verbose(5, opal_restart_globals.output,
|
||||
@ -530,7 +629,14 @@ static int post_env_vars(int prev_pid, char *location)
|
||||
/*
|
||||
* Any directories that need to be created
|
||||
*/
|
||||
opal_crs_base_metadata_read_token(location, CRS_METADATA_MKDIR, &loc_mkdir);
|
||||
if( NULL == (snapshot->metadata = fopen(snapshot->metadata_filename, "r")) ) {
|
||||
opal_show_help("help-opal-restart.txt", "invalid_metadata", true,
|
||||
opal_restart_globals.snapshot_metadata,
|
||||
snapshot->metadata_filename);
|
||||
exit_status = OPAL_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
opal_crs_base_metadata_read_token(snapshot->metadata, CRS_METADATA_MKDIR, &loc_mkdir);
|
||||
argc = opal_argv_count(loc_mkdir);
|
||||
for( i = 0; i < argc; ++i ) {
|
||||
if( NULL != command ) {
|
||||
@ -555,7 +661,7 @@ static int post_env_vars(int prev_pid, char *location)
|
||||
/*
|
||||
* Any files that need to exist
|
||||
*/
|
||||
opal_crs_base_metadata_read_token(location, CRS_METADATA_TOUCH, &loc_touch);
|
||||
opal_crs_base_metadata_read_token(snapshot->metadata, CRS_METADATA_TOUCH, &loc_touch);
|
||||
argc = opal_argv_count(loc_touch);
|
||||
for( i = 0; i < argc; ++i ) {
|
||||
if( NULL != command ) {
|
||||
@ -595,5 +701,10 @@ static int post_env_vars(int prev_pid, char *location)
|
||||
loc_touch = NULL;
|
||||
}
|
||||
|
||||
if( NULL != snapshot->metadata ) {
|
||||
fclose(snapshot->metadata);
|
||||
snapshot->metadata = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
@ -1,6 +1,9 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2009-2010 Cisco Systems, Inc. All rights reserved.
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
@ -27,6 +30,7 @@ AC_DEFUN([ORTE_CONFIG_FILES],[
|
||||
orte/tools/orte-clean/Makefile
|
||||
orte/tools/orte-top/Makefile
|
||||
orte/tools/orte-bootproxy/Makefile
|
||||
orte/tools/orte-migrate/Makefile
|
||||
orte/tools/orte-info/Makefile
|
||||
])
|
||||
])
|
||||
|
38
orte/mca/errmgr/autor/Makefile.am
Обычный файл
38
orte/mca/errmgr/autor/Makefile.am
Обычный файл
@ -0,0 +1,38 @@
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
dist_pkgdata_DATA = help-orte-errmgr-autor.txt
|
||||
|
||||
sources = \
|
||||
errmgr_autor.h \
|
||||
errmgr_autor_component.c \
|
||||
errmgr_autor_module.c
|
||||
|
||||
# Make the output library in this directory, and name it either
|
||||
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||
# (for static builds).
|
||||
|
||||
if OMPI_BUILD_errmgr_autor_DSO
|
||||
component_noinst =
|
||||
component_install = mca_errmgr_autor.la
|
||||
else
|
||||
component_noinst = libmca_errmgr_autor.la
|
||||
component_install =
|
||||
endif
|
||||
|
||||
mcacomponentdir = $(pkglibdir)
|
||||
mcacomponent_LTLIBRARIES = $(component_install)
|
||||
mca_errmgr_autor_la_SOURCES = $(sources)
|
||||
mca_errmgr_autor_la_LDFLAGS = -module -avoid-version
|
||||
|
||||
noinst_LTLIBRARIES = $(component_noinst)
|
||||
libmca_errmgr_autor_la_SOURCES = $(sources)
|
||||
libmca_errmgr_autor_la_LDFLAGS = -module -avoid-version
|
20
orte/mca/errmgr/autor/configure.m4
Обычный файл
20
orte/mca/errmgr/autor/configure.m4
Обычный файл
@ -0,0 +1,20 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
# MCA_errmgr_autor_CONFIG([action-if-found], [action-if-not-found])
|
||||
# -----------------------------------------------------------
|
||||
AC_DEFUN([MCA_errmgr_autor_CONFIG],[
|
||||
# If we don't want FT, don't compile this component
|
||||
AS_IF([test "$opal_want_ft_cr" = "1"],
|
||||
[$1],
|
||||
[$2])
|
||||
])dnl
|
14
orte/mca/errmgr/autor/configure.params
Обычный файл
14
orte/mca/errmgr/autor/configure.params
Обычный файл
@ -0,0 +1,14 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
PARAM_INIT_FILE=errmgr_autor_component.c
|
||||
PARAM_CONFIG_FILES="Makefile"
|
88
orte/mca/errmgr/autor/errmgr_autor.h
Обычный файл
88
orte/mca/errmgr/autor/errmgr_autor.h
Обычный файл
@ -0,0 +1,88 @@
|
||||
/*
|
||||
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
/**
|
||||
* @file
|
||||
*
|
||||
* Automatic Recovery Errmgr component
|
||||
*
|
||||
*/
|
||||
|
||||
#ifndef MCA_ERRMGR_AUTOR_EXPORT_H
|
||||
#define MCA_ERRMGR_AUTOR_EXPORT_H
|
||||
|
||||
#include "orte_config.h"
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/event/event.h"
|
||||
|
||||
#include "orte/mca/filem/filem.h"
|
||||
#include "orte/mca/errmgr/errmgr.h"
|
||||
|
||||
BEGIN_C_DECLS
|
||||
|
||||
/*
|
||||
* Local Component structures
|
||||
*/
|
||||
struct orte_errmgr_autor_component_t {
|
||||
orte_errmgr_base_component_t super; /** Base Errmgr component */
|
||||
bool autor_enabled;
|
||||
bool timing_enabled;
|
||||
int recovery_delay;
|
||||
bool skip_oldnode;
|
||||
};
|
||||
typedef struct orte_errmgr_autor_component_t orte_errmgr_autor_component_t;
|
||||
OPAL_MODULE_DECLSPEC extern orte_errmgr_autor_component_t mca_errmgr_autor_component;
|
||||
|
||||
int orte_errmgr_autor_component_query(mca_base_module_t **module, int *priority);
|
||||
|
||||
/*
|
||||
* Module functions: Global
|
||||
*/
|
||||
int orte_errmgr_autor_global_module_init(void);
|
||||
int orte_errmgr_autor_global_module_finalize(void);
|
||||
|
||||
int orte_errmgr_autor_global_update_state(orte_jobid_t job,
|
||||
orte_job_state_t jobstate,
|
||||
orte_process_name_t *proc_name,
|
||||
orte_proc_state_t state,
|
||||
pid_t pid,
|
||||
orte_exit_code_t exit_code,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
int orte_errmgr_autor_global_process_fault(orte_job_t *jdata,
|
||||
orte_process_name_t *proc_name,
|
||||
orte_proc_state_t state,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
int orte_errmgr_autor_global_suggest_map_targets(orte_proc_t *proc,
|
||||
orte_node_t *oldnode,
|
||||
opal_list_t *node_list,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
int orte_errmgr_autor_global_ft_event(int state);
|
||||
|
||||
/*
|
||||
* Module functions: Local (Daemon)
|
||||
*/
|
||||
int orte_errmgr_autor_local_module_init(void);
|
||||
int orte_errmgr_autor_local_module_finalize(void);
|
||||
|
||||
int orte_errmgr_autor_local_update_state(orte_jobid_t job,
|
||||
orte_job_state_t jobstate,
|
||||
orte_process_name_t *proc_name,
|
||||
orte_proc_state_t state,
|
||||
pid_t pid,
|
||||
orte_exit_code_t exit_code,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
int orte_errmgr_autor_local_ft_event(int state);
|
||||
|
||||
|
||||
END_C_DECLS
|
||||
|
||||
#endif /* MCA_ERRMGR_AUTOR_EXPORT_H */
|
161
orte/mca/errmgr/autor/errmgr_autor_component.c
Обычный файл
161
orte/mca/errmgr/autor/errmgr_autor_component.c
Обычный файл
@ -0,0 +1,161 @@
|
||||
/*
|
||||
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "orte_config.h"
|
||||
#include "opal/util/output.h"
|
||||
|
||||
#include "orte/mca/errmgr/errmgr.h"
|
||||
#include "orte/mca/errmgr/base/base.h"
|
||||
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||
#include "errmgr_autor.h"
|
||||
|
||||
/*
|
||||
* Public string for version number
|
||||
*/
|
||||
const char *orte_errmgr_autor_component_version_string =
|
||||
"ORTE ERRMGR AutoR MCA component version " ORTE_VERSION;
|
||||
|
||||
/*
|
||||
* Local functionality
|
||||
*/
|
||||
static int errmgr_autor_open(void);
|
||||
static int errmgr_autor_close(void);
|
||||
|
||||
/*
|
||||
* Instantiate the public struct with all of our public information
|
||||
* and pointer to our public functions in it
|
||||
*/
|
||||
orte_errmgr_autor_component_t mca_errmgr_autor_component = {
|
||||
/* First do the base component stuff */
|
||||
{
|
||||
/* Handle the general mca_component_t struct containing
|
||||
* meta information about the component itautor
|
||||
*/
|
||||
{
|
||||
ORTE_ERRMGR_BASE_VERSION_3_0_0,
|
||||
/* Component name and version */
|
||||
"autor",
|
||||
ORTE_MAJOR_VERSION,
|
||||
ORTE_MINOR_VERSION,
|
||||
ORTE_RELEASE_VERSION,
|
||||
|
||||
/* Component open and close functions */
|
||||
errmgr_autor_open,
|
||||
errmgr_autor_close,
|
||||
orte_errmgr_autor_component_query
|
||||
},
|
||||
{
|
||||
/* The component is checkpoint ready */
|
||||
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||
},
|
||||
|
||||
/* Verbosity level */
|
||||
0,
|
||||
/* opal_output handler */
|
||||
-1,
|
||||
/* Default priority */
|
||||
20
|
||||
}
|
||||
};
|
||||
|
||||
static int errmgr_autor_open(void)
|
||||
{
|
||||
int val;
|
||||
|
||||
/*
|
||||
* This should be the last componet to ever get used since
|
||||
* it doesn't do anything.
|
||||
*/
|
||||
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||
"priority",
|
||||
"Priority of the ERRMGR autor component",
|
||||
false, false,
|
||||
mca_errmgr_autor_component.super.priority,
|
||||
&mca_errmgr_autor_component.super.priority);
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||
"verbose",
|
||||
"Verbose level for the ERRMGR autor component",
|
||||
false, false,
|
||||
mca_errmgr_autor_component.super.verbose,
|
||||
&mca_errmgr_autor_component.super.verbose);
|
||||
/* If there is a custom verbose level for this component than use it
|
||||
* otherwise take our parents level and output channel
|
||||
*/
|
||||
if ( 0 != mca_errmgr_autor_component.super.verbose) {
|
||||
mca_errmgr_autor_component.super.output_handle = opal_output_open(NULL);
|
||||
opal_output_set_verbosity(mca_errmgr_autor_component.super.output_handle,
|
||||
mca_errmgr_autor_component.super.verbose);
|
||||
} else {
|
||||
mca_errmgr_autor_component.super.output_handle = orte_errmgr_base.output;
|
||||
}
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||
"timing",
|
||||
"Enable Automatic Recovery timer",
|
||||
false, false,
|
||||
0, &val);
|
||||
mca_errmgr_autor_component.timing_enabled = OPAL_INT_TO_BOOL(val);
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||
"enable",
|
||||
"Enable Automatic Recovery (Default: 0/off)",
|
||||
false, false,
|
||||
0, &val);
|
||||
mca_errmgr_autor_component.autor_enabled = OPAL_INT_TO_BOOL(val);
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||
"recovery_delay",
|
||||
"Number of seconds to wait before starting to recover the job after a failure"
|
||||
" [Default: 1 sec]",
|
||||
false, false,
|
||||
1, &val);
|
||||
mca_errmgr_autor_component.recovery_delay = val;
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_autor_component.super.base_version,
|
||||
"skip_oldnode",
|
||||
"Skip the old node from failed proc, even if it is still available"
|
||||
" [Default: Enabled]",
|
||||
false, false,
|
||||
1, &val);
|
||||
mca_errmgr_autor_component.skip_oldnode = OPAL_INT_TO_BOOL(val);
|
||||
|
||||
/*
|
||||
* Debug Output
|
||||
*/
|
||||
opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle,
|
||||
"errmgr:autor: open()");
|
||||
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||
"errmgr:autor: open: priority = %d",
|
||||
mca_errmgr_autor_component.super.priority);
|
||||
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||
"errmgr:autor: open: verbosity = %d",
|
||||
mca_errmgr_autor_component.super.verbose);
|
||||
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||
"errmgr:autor: open: timing = %s",
|
||||
(mca_errmgr_autor_component.timing_enabled ? "Enabled" : "Disabled"));
|
||||
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||
"errmgr:autor: open: Auto. Recover = %s",
|
||||
(mca_errmgr_autor_component.autor_enabled ? "Enabled" : "Disabled"));
|
||||
opal_output_verbose(20, mca_errmgr_autor_component.super.output_handle,
|
||||
"errmgr:autor: open: recover_delay = %d",
|
||||
mca_errmgr_autor_component.recovery_delay);
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
static int errmgr_autor_close(void)
|
||||
{
|
||||
opal_output_verbose(10, mca_errmgr_autor_component.super.output_handle,
|
||||
"errmgr:autor: close()");
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
1194
orte/mca/errmgr/autor/errmgr_autor_module.c
Обычный файл
1194
orte/mca/errmgr/autor/errmgr_autor_module.c
Обычный файл
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
28
orte/mca/errmgr/autor/help-orte-errmgr-autor.txt
Обычный файл
28
orte/mca/errmgr/autor/help-orte-errmgr-autor.txt
Обычный файл
@ -0,0 +1,28 @@
|
||||
-*- text -*-
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# This is the US/English general help file for ORTE ErrMgr AutoR framework.
|
||||
#
|
||||
[recovering_job]
|
||||
Notice: The processes listed below failed unexpectedly.
|
||||
Using the last checkpoint to recover the job.
|
||||
Please standby.
|
||||
%s
|
||||
[recovery_complete]
|
||||
Notice: The job has been successfully recovered from the
|
||||
last checkpoint.
|
||||
[failed_to_recover_proc]
|
||||
Error: The process below has failed. There is no checkpoint available for
|
||||
this job, so we are terminating the application since automatic
|
||||
recovery cannot occur.
|
||||
Internal Name: %s
|
||||
MCW Rank: %d
|
@ -1,5 +1,5 @@
|
||||
#
|
||||
# Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
|
||||
# Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
# Copyright (c) 2004-2005 The University of Tennessee and The University
|
||||
@ -24,4 +24,5 @@ libmca_errmgr_la_SOURCES += \
|
||||
base/errmgr_base_close.c \
|
||||
base/errmgr_base_select.c \
|
||||
base/errmgr_base_open.c \
|
||||
base/errmgr_base_fns.c
|
||||
base/errmgr_base_fns.c \
|
||||
base/errmgr_base_tool.c
|
||||
|
@ -30,6 +30,7 @@
|
||||
#include "opal/class/opal_list.h"
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "orte/mca/snapc/base/base.h"
|
||||
#include "orte/mca/errmgr/errmgr.h"
|
||||
|
||||
|
||||
@ -56,6 +57,51 @@ ORTE_DECLSPEC int orte_errmgr_base_close(void);
|
||||
*/
|
||||
ORTE_DECLSPEC extern opal_list_t orte_errmgr_base_components_available;
|
||||
|
||||
/**
|
||||
* Interfaces for orte-migrate tool
|
||||
*/
|
||||
#if OPAL_ENABLE_FT_CR
|
||||
/**
|
||||
* Migrating States
|
||||
*/
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_ERROR (ORTE_SNAPC_CKPT_MAX + 1)
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS (ORTE_SNAPC_CKPT_MAX + 2)
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_NONE (ORTE_SNAPC_CKPT_MAX + 3)
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_REQUEST (ORTE_SNAPC_CKPT_MAX + 4)
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_RUNNING (ORTE_SNAPC_CKPT_MAX + 5)
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_RUN_CKPT (ORTE_SNAPC_CKPT_MAX + 6)
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_STARTUP (ORTE_SNAPC_CKPT_MAX + 7)
|
||||
#define ORTE_ERRMGR_MIGRATE_STATE_FINISH (ORTE_SNAPC_CKPT_MAX + 8)
|
||||
#define ORTE_ERRMGR_MIGRATE_MAX (ORTE_SNAPC_CKPT_MAX + 9)
|
||||
|
||||
/*
|
||||
* Commands for command line tool and ErrMgr interaction
|
||||
*/
|
||||
typedef uint8_t orte_errmgr_tool_cmd_flag_t;
|
||||
#define ORTE_ERRMGR_MIGRATE_TOOL_CMD OPAL_UINT8
|
||||
#define ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD 1
|
||||
#define ORTE_ERRMGR_MIGRATE_TOOL_UPDATE_CMD 2
|
||||
|
||||
/* Initialize/Finalize the orte-migrate communication functionality */
|
||||
ORTE_DECLSPEC int orte_errmgr_base_tool_init(void);
|
||||
ORTE_DECLSPEC int orte_errmgr_base_tool_finalize(void);
|
||||
|
||||
ORTE_DECLSPEC int orte_errmgr_base_migrate_state_str(char ** state_str, int state);
|
||||
|
||||
ORTE_DECLSPEC int orte_errmgr_base_migrate_update(int status);
|
||||
|
||||
/*
|
||||
* Interfaces for C/R related recovery
|
||||
*/
|
||||
ORTE_DECLSPEC int orte_errmgr_base_update_app_context_for_cr_recovery(orte_job_t *jobdata,
|
||||
orte_proc_t *proc,
|
||||
opal_list_t *local_snapshots);
|
||||
|
||||
ORTE_DECLSPEC int orte_errmgr_base_restart_job(orte_jobid_t jobid, char * global_handle, int seq_num);
|
||||
ORTE_DECLSPEC int orte_errmgr_base_migrate_job(orte_jobid_t jobid, orte_snapc_base_request_op_t *datum);
|
||||
|
||||
#endif
|
||||
|
||||
/*
|
||||
* Additional External API function declared in errmgr.h
|
||||
*/
|
||||
|
@ -21,27 +21,157 @@
|
||||
#include "orte_config.h"
|
||||
#include "orte/constants.h"
|
||||
|
||||
#ifdef HAVE_STRING_H
|
||||
#include <string.h>
|
||||
#endif
|
||||
#if HAVE_SYS_TYPES_H
|
||||
#include <sys/types.h>
|
||||
#endif /* HAVE_SYS_TYPES_H */
|
||||
#ifdef HAVE_UNISTD_H
|
||||
#include <unistd.h>
|
||||
#endif
|
||||
#endif /* HAVE_UNISTD_H */
|
||||
#if HAVE_SYS_TYPES_H
|
||||
#include <sys/types.h>
|
||||
#endif /* HAVE_SYS_TYPES_H */
|
||||
#if HAVE_SYS_STAT_H
|
||||
#include <sys/stat.h>
|
||||
#endif /* HAVE_SYS_STAT_H */
|
||||
#ifdef HAVE_DIRENT_H
|
||||
#include <dirent.h>
|
||||
#endif /* HAVE_DIRENT_H */
|
||||
#include <time.h>
|
||||
|
||||
#include <stdlib.h>
|
||||
#include <stdarg.h>
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
#include "opal/mca/base/mca_base_param.h"
|
||||
#include "opal/util/trace.h"
|
||||
#include "opal/util/os_dirpath.h"
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/util/basename.h"
|
||||
#include "opal/util/argv.h"
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
#include "opal/util/opal_sos.h"
|
||||
|
||||
#include "orte/util/name_fns.h"
|
||||
#include "orte/util/session_dir.h"
|
||||
|
||||
#include "orte/runtime/orte_globals.h"
|
||||
#include "orte/runtime/runtime.h"
|
||||
#include "orte/runtime/orte_wait.h"
|
||||
#include "orte/runtime/orte_locks.h"
|
||||
|
||||
#include "orte/mca/ess/ess.h"
|
||||
#include "orte/mca/odls/odls.h"
|
||||
#include "orte/mca/plm/plm.h"
|
||||
#include "orte/mca/rml/rml.h"
|
||||
#include "orte/mca/rml/rml_types.h"
|
||||
#include "orte/mca/routed/routed.h"
|
||||
#include "orte/runtime/orte_globals.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
#include "orte/mca/snapc/base/base.h"
|
||||
#include "orte/mca/sstore/sstore.h"
|
||||
#include "orte/mca/sstore/base/base.h"
|
||||
|
||||
#include "orte/mca/errmgr/errmgr.h"
|
||||
#include "orte/mca/errmgr/base/base.h"
|
||||
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||
|
||||
/*
|
||||
* Object stuff
|
||||
*/
|
||||
void orte_errmgr_predicted_proc_construct(orte_errmgr_predicted_proc_t *item);
|
||||
void orte_errmgr_predicted_proc_destruct( orte_errmgr_predicted_proc_t *item);
|
||||
|
||||
OBJ_CLASS_INSTANCE(orte_errmgr_predicted_proc_t,
|
||||
opal_list_item_t,
|
||||
orte_errmgr_predicted_proc_construct,
|
||||
orte_errmgr_predicted_proc_destruct);
|
||||
|
||||
void orte_errmgr_predicted_proc_construct(orte_errmgr_predicted_proc_t *item)
|
||||
{
|
||||
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||
}
|
||||
|
||||
void orte_errmgr_predicted_proc_destruct( orte_errmgr_predicted_proc_t *item)
|
||||
{
|
||||
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||
}
|
||||
|
||||
void orte_errmgr_predicted_node_construct(orte_errmgr_predicted_node_t *item);
|
||||
void orte_errmgr_predicted_node_destruct( orte_errmgr_predicted_node_t *item);
|
||||
|
||||
OBJ_CLASS_INSTANCE(orte_errmgr_predicted_node_t,
|
||||
opal_list_item_t,
|
||||
orte_errmgr_predicted_node_construct,
|
||||
orte_errmgr_predicted_node_destruct);
|
||||
|
||||
void orte_errmgr_predicted_node_construct(orte_errmgr_predicted_node_t *item)
|
||||
{
|
||||
item->node_name = NULL;
|
||||
}
|
||||
|
||||
void orte_errmgr_predicted_node_destruct( orte_errmgr_predicted_node_t *item)
|
||||
{
|
||||
if( NULL != item->node_name ) {
|
||||
free(item->node_name);
|
||||
item->node_name = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
void orte_errmgr_predicted_map_construct(orte_errmgr_predicted_map_t *item);
|
||||
void orte_errmgr_predicted_map_destruct( orte_errmgr_predicted_map_t *item);
|
||||
|
||||
OBJ_CLASS_INSTANCE(orte_errmgr_predicted_map_t,
|
||||
opal_list_item_t,
|
||||
orte_errmgr_predicted_map_construct,
|
||||
orte_errmgr_predicted_map_destruct);
|
||||
|
||||
void orte_errmgr_predicted_map_construct(orte_errmgr_predicted_map_t *item)
|
||||
{
|
||||
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||
|
||||
item->node_name = NULL;
|
||||
|
||||
item->map_proc_name.vpid = ORTE_VPID_INVALID;
|
||||
item->map_proc_name.jobid = ORTE_JOBID_INVALID;
|
||||
|
||||
item->map_node_name = NULL;
|
||||
item->off_current_node = false;
|
||||
item->pre_map_fixed_node = NULL;
|
||||
}
|
||||
|
||||
void orte_errmgr_predicted_map_destruct( orte_errmgr_predicted_map_t *item)
|
||||
{
|
||||
item->proc_name.vpid = ORTE_VPID_INVALID;
|
||||
item->proc_name.jobid = ORTE_JOBID_INVALID;
|
||||
|
||||
if( NULL != item->node_name ) {
|
||||
free(item->node_name);
|
||||
item->node_name = NULL;
|
||||
}
|
||||
|
||||
item->map_proc_name.vpid = ORTE_VPID_INVALID;
|
||||
item->map_proc_name.jobid = ORTE_JOBID_INVALID;
|
||||
|
||||
if( NULL != item->map_node_name ) {
|
||||
free(item->map_node_name);
|
||||
item->map_node_name = NULL;
|
||||
}
|
||||
|
||||
item->off_current_node = false;
|
||||
|
||||
if( NULL != item->pre_map_fixed_node ) {
|
||||
free(item->pre_map_fixed_node);
|
||||
item->pre_map_fixed_node = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Public interfaces
|
||||
*/
|
||||
@ -135,9 +265,9 @@ int orte_errmgr_base_abort(int error_code, char *fmt, ...)
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
int orte_errmgr_base_predicted_fault(char ***proc_list,
|
||||
char ***node_list,
|
||||
char ***suggested_nodes)
|
||||
int orte_errmgr_base_predicted_fault(opal_list_t *proc_list,
|
||||
opal_list_t *node_list,
|
||||
opal_list_t *suggested_map)
|
||||
{
|
||||
orte_errmgr_base_module_t *module = NULL;
|
||||
int i, rc;
|
||||
@ -155,7 +285,7 @@ int orte_errmgr_base_predicted_fault(char ***proc_list,
|
||||
continue;
|
||||
}
|
||||
if( NULL != module->predicted_fault ) {
|
||||
rc = module->predicted_fault(proc_list, node_list, suggested_nodes, &stack_state);
|
||||
rc = module->predicted_fault(proc_list, node_list, suggested_map, &stack_state);
|
||||
if (ORTE_SUCCESS != rc || ORTE_ERRMGR_STACK_STATE_COMPLETE & stack_state) {
|
||||
break;
|
||||
}
|
||||
@ -218,3 +348,348 @@ int orte_errmgr_base_ft_event(int state)
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
/********************
|
||||
* Utility functions
|
||||
********************/
|
||||
#if OPAL_ENABLE_FT_CR
|
||||
int orte_errmgr_base_migrate_state_str(char ** state_str, int state)
|
||||
{
|
||||
switch(state) {
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_NONE:
|
||||
*state_str = strdup(" -- ");
|
||||
break;
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_REQUEST:
|
||||
*state_str = strdup("Requested");
|
||||
break;
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_RUNNING:
|
||||
*state_str = strdup("Running");
|
||||
break;
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_RUN_CKPT:
|
||||
*state_str = strdup("Checkpointing");
|
||||
break;
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_STARTUP:
|
||||
*state_str = strdup("Restarting");
|
||||
break;
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_FINISH:
|
||||
*state_str = strdup("Finished");
|
||||
break;
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_ERROR:
|
||||
*state_str = strdup("Error");
|
||||
break;
|
||||
case ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS:
|
||||
*state_str = strdup("Error: Migration in progress");
|
||||
break;
|
||||
default:
|
||||
asprintf(state_str, "Unknown %d", state);
|
||||
break;
|
||||
}
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
#endif
|
||||
|
||||
#if OPAL_ENABLE_FT_CR
|
||||
int orte_errmgr_base_update_app_context_for_cr_recovery(orte_job_t *jobdata,
|
||||
orte_proc_t *proc,
|
||||
opal_list_t *local_snapshots)
|
||||
{
|
||||
int ret, exit_status = ORTE_SUCCESS;
|
||||
opal_list_item_t *item = NULL;
|
||||
orte_std_cntr_t i_app;
|
||||
int argc = 0;
|
||||
orte_app_context_t *cur_app_context = NULL;
|
||||
orte_app_context_t *new_app_context = NULL;
|
||||
orte_sstore_base_local_snapshot_info_t *vpid_snapshot = NULL;
|
||||
char *reference_fmt_str = NULL;
|
||||
char *location_str = NULL;
|
||||
char *cache_location_str = NULL;
|
||||
char *ref_location_fmt_str = NULL;
|
||||
char *tmp_str = NULL;
|
||||
char *global_snapshot_ref = NULL;
|
||||
char *global_snapshot_seq = NULL;
|
||||
|
||||
/*
|
||||
* Get the snapshot restart command for this process
|
||||
* JJH CLEANUP: Pass in the vpid_snapshot, so we don't have to look it up every time?
|
||||
*/
|
||||
for(item = opal_list_get_first(local_snapshots);
|
||||
item != opal_list_get_end(local_snapshots);
|
||||
item = opal_list_get_next(item) ) {
|
||||
vpid_snapshot = (orte_sstore_base_local_snapshot_info_t*)item;
|
||||
if(OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL,
|
||||
&vpid_snapshot->process_name,
|
||||
&proc->name) ) {
|
||||
break;
|
||||
}
|
||||
else {
|
||||
vpid_snapshot = NULL;
|
||||
}
|
||||
}
|
||||
|
||||
if( NULL == vpid_snapshot ) {
|
||||
ORTE_ERROR_LOG(ORTE_ERROR);
|
||||
exit_status = ORTE_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||
SSTORE_METADATA_LOCAL_SNAP_REF_FMT,
|
||||
&reference_fmt_str);
|
||||
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||
SSTORE_METADATA_LOCAL_SNAP_LOC,
|
||||
&location_str);
|
||||
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||
SSTORE_METADATA_LOCAL_SNAP_REF_LOC_FMT,
|
||||
&ref_location_fmt_str);
|
||||
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||
SSTORE_METADATA_GLOBAL_SNAP_REF,
|
||||
&global_snapshot_ref);
|
||||
orte_sstore.get_attr(vpid_snapshot->ss_handle,
|
||||
SSTORE_METADATA_GLOBAL_SNAP_SEQ,
|
||||
&global_snapshot_seq);
|
||||
|
||||
/*
|
||||
* Find current app_context
|
||||
*/
|
||||
cur_app_context = NULL;
|
||||
for(i_app = 0; i_app < opal_pointer_array_get_size(jobdata->apps); ++i_app) {
|
||||
cur_app_context = (orte_app_context_t *)opal_pointer_array_get_item(jobdata->apps,
|
||||
i_app);
|
||||
if( NULL == cur_app_context ) {
|
||||
continue;
|
||||
}
|
||||
if(proc->app_idx == cur_app_context->idx) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if( NULL == cur_app_context ) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* if > 1 processes in this app context
|
||||
* Create a new app_context
|
||||
* Copy over attributes
|
||||
* Add it to the job_t data structure
|
||||
* Associate it with this process in the job
|
||||
* else
|
||||
* Reuse this app_context
|
||||
*/
|
||||
if( cur_app_context->num_procs > 1 ) {
|
||||
/* Create a new app_context */
|
||||
new_app_context = OBJ_NEW(orte_app_context_t);
|
||||
|
||||
/* Copy over attributes */
|
||||
new_app_context->idx = cur_app_context->idx;
|
||||
new_app_context->app = NULL; /* strdup(cur_app_context->app); */
|
||||
new_app_context->num_procs = 1;
|
||||
new_app_context->argv = NULL; /* opal_argv_copy(cur_app_context->argv); */
|
||||
new_app_context->env = opal_argv_copy(cur_app_context->env);
|
||||
new_app_context->cwd = (NULL == cur_app_context->cwd ? NULL :
|
||||
strdup(cur_app_context->cwd));
|
||||
new_app_context->user_specified_cwd = cur_app_context->user_specified_cwd;
|
||||
new_app_context->hostfile = (NULL == cur_app_context->hostfile ? NULL :
|
||||
strdup(cur_app_context->hostfile));
|
||||
new_app_context->add_hostfile = (NULL == cur_app_context->add_hostfile ? NULL :
|
||||
strdup(cur_app_context->add_hostfile));
|
||||
new_app_context->dash_host = opal_argv_copy(cur_app_context->dash_host);
|
||||
new_app_context->prefix_dir = (NULL == cur_app_context->prefix_dir ? NULL :
|
||||
strdup(cur_app_context->prefix_dir));
|
||||
new_app_context->preload_binary = false;
|
||||
new_app_context->preload_libs = false;
|
||||
new_app_context->preload_files_dest_dir = NULL;
|
||||
new_app_context->preload_files_src_dir = NULL;
|
||||
|
||||
asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid);
|
||||
asprintf(&(new_app_context->sstore_load),
|
||||
"%s:%s:%s:%s:%s:%s",
|
||||
location_str,
|
||||
global_snapshot_ref,
|
||||
tmp_str,
|
||||
(vpid_snapshot->compress_comp == NULL ? "" : vpid_snapshot->compress_comp),
|
||||
(vpid_snapshot->compress_postfix == NULL ? "" : vpid_snapshot->compress_postfix),
|
||||
global_snapshot_seq);
|
||||
|
||||
new_app_context->used_on_node = cur_app_context->used_on_node;
|
||||
|
||||
/* Add it to the job_t data structure */
|
||||
/*current_global_jobdata->num_apps++; */
|
||||
new_app_context->idx = (jobdata->num_apps);
|
||||
proc->app_idx = new_app_context->idx;
|
||||
|
||||
opal_pointer_array_add(jobdata->apps, new_app_context);
|
||||
++(jobdata->num_apps);
|
||||
|
||||
/* Remove association with the old app_context */
|
||||
--(cur_app_context->num_procs);
|
||||
}
|
||||
else {
|
||||
new_app_context = cur_app_context;
|
||||
|
||||
/* Cleanout old stuff */
|
||||
free(new_app_context->app);
|
||||
new_app_context->app = NULL;
|
||||
|
||||
opal_argv_free(new_app_context->argv);
|
||||
new_app_context->argv = NULL;
|
||||
|
||||
asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid);
|
||||
asprintf(&(new_app_context->sstore_load),
|
||||
"%s:%s:%s:%s:%s:%s",
|
||||
location_str,
|
||||
global_snapshot_ref,
|
||||
tmp_str,
|
||||
(vpid_snapshot->compress_comp == NULL ? "" : vpid_snapshot->compress_comp),
|
||||
(vpid_snapshot->compress_postfix == NULL ? "" : vpid_snapshot->compress_postfix),
|
||||
global_snapshot_seq);
|
||||
}
|
||||
|
||||
/*
|
||||
* Update the app_context with the restart informaiton
|
||||
*/
|
||||
new_app_context->app = strdup("opal-restart");
|
||||
opal_argv_append(&argc, &(new_app_context->argv), new_app_context->app);
|
||||
opal_argv_append(&argc, &(new_app_context->argv), "-l");
|
||||
opal_argv_append(&argc, &(new_app_context->argv), location_str);
|
||||
opal_argv_append(&argc, &(new_app_context->argv), "-m");
|
||||
opal_argv_append(&argc, &(new_app_context->argv), orte_sstore_base_local_metadata_filename);
|
||||
opal_argv_append(&argc, &(new_app_context->argv), "-r");
|
||||
if( NULL != tmp_str ) {
|
||||
free(tmp_str);
|
||||
tmp_str = NULL;
|
||||
}
|
||||
asprintf(&tmp_str, reference_fmt_str, vpid_snapshot->process_name.vpid);
|
||||
opal_argv_append(&argc, &(new_app_context->argv), tmp_str);
|
||||
|
||||
cleanup:
|
||||
if( NULL != tmp_str) {
|
||||
free(tmp_str);
|
||||
tmp_str = NULL;
|
||||
}
|
||||
if( NULL != location_str ) {
|
||||
free(location_str);
|
||||
location_str = NULL;
|
||||
}
|
||||
if( NULL != cache_location_str ) {
|
||||
free(cache_location_str);
|
||||
cache_location_str = NULL;
|
||||
}
|
||||
if( NULL != reference_fmt_str ) {
|
||||
free(reference_fmt_str);
|
||||
reference_fmt_str = NULL;
|
||||
}
|
||||
if( NULL != ref_location_fmt_str ) {
|
||||
free(ref_location_fmt_str);
|
||||
ref_location_fmt_str = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
#endif
|
||||
|
||||
#if OPAL_ENABLE_FT_CR
|
||||
int orte_errmgr_base_restart_job(orte_jobid_t jobid, char * global_handle, int seq_num)
|
||||
{
|
||||
int ret, exit_status = ORTE_SUCCESS;
|
||||
orte_process_name_t loc_proc;
|
||||
orte_sstore_base_handle_t prev_sstore_handle = ORTE_SSTORE_HANDLE_INVALID;
|
||||
|
||||
/* JJH First determine if we can recover this way */
|
||||
|
||||
/*
|
||||
* Find the corresponding sstore handle
|
||||
*/
|
||||
prev_sstore_handle = orte_sstore_handle_last_stable;
|
||||
if( ORTE_SUCCESS != (ret = orte_sstore.request_restart_handle(&orte_sstore_handle_last_stable,
|
||||
NULL,
|
||||
global_handle,
|
||||
seq_num,
|
||||
NULL)) ) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Start the recovery
|
||||
*/
|
||||
orte_snapc_base_has_recovered = false;
|
||||
loc_proc.jobid = jobid;
|
||||
loc_proc.vpid = 0;
|
||||
orte_errmgr_base_update_state(jobid, ORTE_JOB_STATE_RESTART,
|
||||
&loc_proc, ORTE_PROC_STATE_KILLED_BY_CMD,
|
||||
0, 0);
|
||||
while( !orte_snapc_base_has_recovered ) {
|
||||
opal_progress();
|
||||
}
|
||||
orte_sstore_handle_last_stable = prev_sstore_handle;
|
||||
|
||||
cleanup:
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
int orte_errmgr_base_migrate_job(orte_jobid_t jobid, orte_snapc_base_request_op_t *datum)
|
||||
{
|
||||
int ret, exit_status = ORTE_SUCCESS;
|
||||
int i;
|
||||
opal_list_t *proc_list = NULL;
|
||||
opal_list_t *node_list = NULL;
|
||||
opal_list_t *suggested_map_list = NULL;
|
||||
orte_errmgr_predicted_map_t *onto_map = NULL;
|
||||
#if 0
|
||||
orte_errmgr_predicted_proc_t *off_proc = NULL;
|
||||
orte_errmgr_predicted_node_t *off_node = NULL;
|
||||
#endif
|
||||
|
||||
proc_list = OBJ_NEW(opal_list_t);
|
||||
node_list = OBJ_NEW(opal_list_t);
|
||||
suggested_map_list = OBJ_NEW(opal_list_t);
|
||||
|
||||
for( i = 0; i < datum->mig_num; ++i ) {
|
||||
/*
|
||||
* List all processes that are included in the migration.
|
||||
* We will sort them out in the component.
|
||||
*/
|
||||
onto_map = OBJ_NEW(orte_errmgr_predicted_map_t);
|
||||
|
||||
if( (datum->mig_off_node)[i] ) {
|
||||
onto_map->off_current_node = true;
|
||||
} else {
|
||||
onto_map->off_current_node = false;
|
||||
}
|
||||
|
||||
/* Who to migrate */
|
||||
onto_map->proc_name.jobid = jobid;
|
||||
onto_map->proc_name.vpid = (datum->mig_vpids)[i];
|
||||
|
||||
/* Destination */
|
||||
onto_map->map_proc_name.jobid = jobid;
|
||||
onto_map->map_proc_name.vpid = (datum->mig_vpid_pref)[i];
|
||||
|
||||
if( ((datum->mig_host_pref)[i])[0] == '\0') {
|
||||
onto_map->map_node_name = NULL;
|
||||
} else {
|
||||
onto_map->map_node_name = strdup((datum->mig_host_pref)[i]);
|
||||
}
|
||||
|
||||
opal_list_append(suggested_map_list, &(onto_map->super));
|
||||
}
|
||||
|
||||
if( ORTE_SUCCESS != (ret = orte_errmgr_base_predicted_fault(proc_list, node_list, suggested_map_list)) ) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
#endif
|
||||
|
||||
/********************
|
||||
* Local Functions
|
||||
********************/
|
||||
|
477
orte/mca/errmgr/base/errmgr_base_tool.c
Обычный файл
477
orte/mca/errmgr/base/errmgr_base_tool.c
Обычный файл
@ -0,0 +1,477 @@
|
||||
/*
|
||||
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "orte_config.h"
|
||||
|
||||
#ifdef HAVE_STRING_H
|
||||
#include <string.h>
|
||||
#endif
|
||||
#if HAVE_SYS_TYPES_H
|
||||
#include <sys/types.h>
|
||||
#endif /* HAVE_SYS_TYPES_H */
|
||||
#ifdef HAVE_UNISTD_H
|
||||
#include <unistd.h>
|
||||
#endif /* HAVE_UNISTD_H */
|
||||
#if HAVE_SYS_TYPES_H
|
||||
#include <sys/types.h>
|
||||
#endif /* HAVE_SYS_TYPES_H */
|
||||
#if HAVE_SYS_STAT_H
|
||||
#include <sys/stat.h>
|
||||
#endif /* HAVE_SYS_STAT_H */
|
||||
#ifdef HAVE_DIRENT_H
|
||||
#include <dirent.h>
|
||||
#endif /* HAVE_DIRENT_H */
|
||||
#include <time.h>
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/mca/base/base.h"
|
||||
|
||||
#include "opal/mca/base/mca_base_param.h"
|
||||
#include "opal/util/os_dirpath.h"
|
||||
#include "opal/util/output.h"
|
||||
#include "opal/util/basename.h"
|
||||
#include "opal/util/argv.h"
|
||||
#include "opal/mca/crs/crs.h"
|
||||
#include "opal/mca/crs/base/base.h"
|
||||
|
||||
#include "orte/mca/rml/rml.h"
|
||||
#include "orte/mca/rml/rml_types.h"
|
||||
#include "orte/mca/snapc/snapc.h"
|
||||
#include "orte/runtime/orte_globals.h"
|
||||
#include "orte/util/name_fns.h"
|
||||
|
||||
#include "orte/mca/errmgr/errmgr.h"
|
||||
#include "orte/mca/errmgr/base/base.h"
|
||||
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||
|
||||
/**
|
||||
* This file contains function for the HNP to communicate with the
|
||||
* orte-migrate command.
|
||||
*/
|
||||
#if OPAL_ENABLE_FT_CR
|
||||
|
||||
/******************
|
||||
* Local Functions
|
||||
******************/
|
||||
static int errmgr_base_tool_start_cmdline_listener(void);
|
||||
static int errmgr_base_tool_stop_cmdline_listener(void);
|
||||
|
||||
static void errmgr_base_tool_cmdline_recv(int status,
|
||||
orte_process_name_t* sender,
|
||||
opal_buffer_t* buffer,
|
||||
orte_rml_tag_t tag,
|
||||
void* cbdata);
|
||||
static void errmgr_base_tool_cmdline_process_recv(int fd,
|
||||
short event,
|
||||
void *cbdata);
|
||||
|
||||
|
||||
/******************
|
||||
* Object stuff
|
||||
******************/
|
||||
static orte_process_name_t errmgr_cmdline_sender = {ORTE_JOBID_INVALID, ORTE_VPID_INVALID};
|
||||
static bool errmgr_cmdline_recv_issued = false;
|
||||
static int errmgr_tool_initialized = false;
|
||||
|
||||
/********************
|
||||
* Module Functions
|
||||
********************/
|
||||
int orte_errmgr_base_tool_init(void)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if( (++errmgr_tool_initialized) != 1 ) {
|
||||
if( errmgr_tool_initialized < 1 ) {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/* Only HNP communicates with tools */
|
||||
if (! ORTE_PROC_IS_HNP) {
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* Setup command line migrate tool request listener
|
||||
*/
|
||||
if( ORTE_SUCCESS != (ret = errmgr_base_tool_start_cmdline_listener()) ) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
int orte_errmgr_base_tool_finalize(void)
|
||||
{
|
||||
int ret;
|
||||
|
||||
if( (--errmgr_tool_initialized) != 0 ) {
|
||||
if( errmgr_tool_initialized < 0 ) {
|
||||
return OPAL_ERROR;
|
||||
}
|
||||
return OPAL_SUCCESS;
|
||||
}
|
||||
|
||||
/* Only HNP communicates with tools */
|
||||
if (! ORTE_PROC_IS_HNP) {
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* Clean up listeners
|
||||
*/
|
||||
if( ORTE_SUCCESS != (ret = errmgr_base_tool_stop_cmdline_listener()) ) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
return ret;
|
||||
}
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
int orte_errmgr_base_migrate_update(int status)
|
||||
{
|
||||
int ret, exit_status = ORTE_SUCCESS;
|
||||
opal_buffer_t *loc_buffer = NULL;
|
||||
orte_errmgr_tool_cmd_flag_t command = ORTE_ERRMGR_MIGRATE_TOOL_UPDATE_CMD;
|
||||
|
||||
/* Only HNP communicates with tools */
|
||||
if (! ORTE_PROC_IS_HNP) {
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* If this is an invalid state, then return an error
|
||||
*/
|
||||
if( ORTE_ERRMGR_MIGRATE_MAX < status ) {
|
||||
opal_output(orte_errmgr_base.output,
|
||||
"errmgr:base:tool:update() Error: Invalid state %d < (Max %d)",
|
||||
status, ORTE_ERRMGR_MIGRATE_MAX);
|
||||
return ORTE_ERR_BAD_PARAM;
|
||||
}
|
||||
|
||||
/*
|
||||
* If the caller is indicating that they are finished and ready for another
|
||||
* command, then repost the RML listener.
|
||||
*/
|
||||
if( ORTE_ERRMGR_MIGRATE_STATE_NONE == status ) {
|
||||
if( ORTE_SUCCESS != (ret = errmgr_base_tool_start_cmdline_listener()) ) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
return ret;
|
||||
}
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* Noop if invalid peer, or peer not specified
|
||||
*/
|
||||
if( OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_NAME_INVALID, &errmgr_cmdline_sender) ) {
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
/*
|
||||
* Do not send to self, as that is silly.
|
||||
*/
|
||||
if( OPAL_EQUAL == orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_PROC_MY_HNP, &errmgr_cmdline_sender) ) {
|
||||
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||
"errmgr:base:tool:update() Warning: Do not send to self!\n"));
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||
"errmgr:base:tool:update() Sending update command <status %d>\n",
|
||||
status));
|
||||
|
||||
/********************
|
||||
* Send over the status of the checkpoint
|
||||
* - migration state
|
||||
********************/
|
||||
if (NULL == (loc_buffer = OBJ_NEW(opal_buffer_t))) {
|
||||
exit_status = ORTE_ERROR;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &command, 1, ORTE_ERRMGR_MIGRATE_TOOL_CMD)) ) {
|
||||
opal_output(orte_errmgr_base.output,
|
||||
"errmgr:base:tool:update() Error: DSS Pack (cmd) Failure (ret = %d)\n",
|
||||
ret);
|
||||
ORTE_ERROR_LOG(ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (ORTE_SUCCESS != (ret = opal_dss.pack(loc_buffer, &status, 1, OPAL_INT))) {
|
||||
opal_output(orte_errmgr_base.output,
|
||||
"errmgr:base:tool:update() Error: DSS Pack (status) Failure (ret = %d)\n",
|
||||
ret);
|
||||
ORTE_ERROR_LOG(ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (0 > (ret = orte_rml.send_buffer(&errmgr_cmdline_sender, loc_buffer, ORTE_RML_TAG_MIGRATE, 0))) {
|
||||
opal_output(orte_errmgr_base.output,
|
||||
"errmgr:base:tool:update() Error: Send (status) Failure (ret = %d)\n",
|
||||
ret);
|
||||
ORTE_ERROR_LOG(ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
if(NULL != loc_buffer) {
|
||||
OBJ_RELEASE(loc_buffer);
|
||||
loc_buffer = NULL;
|
||||
}
|
||||
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
/********************
|
||||
* Utility functions
|
||||
********************/
|
||||
|
||||
/********************
|
||||
* Local Functions
|
||||
********************/
|
||||
static int errmgr_base_tool_start_cmdline_listener(void)
|
||||
{
|
||||
int ret, exit_status = ORTE_SUCCESS;
|
||||
|
||||
if (errmgr_cmdline_recv_issued && ORTE_PROC_IS_HNP) {
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output,
|
||||
"errmgr:base:tool: Startup Command Line Channel"));
|
||||
|
||||
/*
|
||||
* Coordinator command listener
|
||||
*/
|
||||
errmgr_cmdline_sender.jobid = ORTE_JOBID_INVALID;
|
||||
errmgr_cmdline_sender.vpid = ORTE_VPID_INVALID;
|
||||
if (ORTE_SUCCESS != (ret = orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
|
||||
ORTE_RML_TAG_MIGRATE,
|
||||
0,
|
||||
errmgr_base_tool_cmdline_recv,
|
||||
NULL))) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
errmgr_cmdline_recv_issued = true;
|
||||
|
||||
cleanup:
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
|
||||
static int errmgr_base_tool_stop_cmdline_listener(void)
|
||||
{
|
||||
int ret, exit_status = ORTE_SUCCESS;
|
||||
|
||||
if (!errmgr_cmdline_recv_issued && ORTE_PROC_IS_HNP) {
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
OPAL_OUTPUT_VERBOSE((5, orte_errmgr_base.output,
|
||||
"errmgr:base:tool: Shutdown Command Line Channel"));
|
||||
|
||||
if (ORTE_SUCCESS != (ret = orte_rml.recv_cancel(ORTE_NAME_WILDCARD,
|
||||
ORTE_RML_TAG_MIGRATE))) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
exit_status = ret;
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
errmgr_cmdline_recv_issued = false;
|
||||
|
||||
cleanup:
|
||||
return exit_status;
|
||||
}
|
||||
|
||||
/*****************
|
||||
* Listener Callbacks
|
||||
*****************/
|
||||
static void errmgr_base_tool_cmdline_recv(int status,
|
||||
orte_process_name_t* sender,
|
||||
opal_buffer_t* buffer,
|
||||
orte_rml_tag_t tag,
|
||||
void* cbdata)
|
||||
{
|
||||
if( ORTE_RML_TAG_MIGRATE != tag ) {
|
||||
opal_output(orte_errmgr_base.output,
|
||||
"errmgr:base:tool:recv() Error: Unknown tag: Received a command message from %s (tag = %d).",
|
||||
ORTE_NAME_PRINT(sender), tag);
|
||||
ORTE_ERROR_LOG(ORTE_ERR_BAD_PARAM);
|
||||
return;
|
||||
}
|
||||
|
||||
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||
"errmgr:base:tool:recv() Command Line: Start a migration operation [Sender = %s]",
|
||||
ORTE_NAME_PRINT(sender)));
|
||||
|
||||
errmgr_cmdline_recv_issued = false; /* Not a persistent RML message */
|
||||
|
||||
/*
|
||||
* Do not process this right away - we need to get out of the recv before
|
||||
* we process the message to avoid performing the rest of the job while
|
||||
* inside this receive! Instead, setup an event so that the message gets processed
|
||||
* as soon as we leave the recv.
|
||||
*
|
||||
* The macro makes a copy of the buffer, which we release above - the incoming
|
||||
* buffer, however, is NOT released here, although its payload IS transferred
|
||||
* to the message buffer for later processing
|
||||
*
|
||||
*/
|
||||
ORTE_MESSAGE_EVENT(sender, buffer, tag, errmgr_base_tool_cmdline_process_recv);
|
||||
|
||||
return;
|
||||
}
|
||||
|
||||
static void errmgr_base_tool_cmdline_process_recv(int fd, short event, void *cbdata)
|
||||
{
|
||||
int ret;
|
||||
orte_message_event_t *mev = (orte_message_event_t*)cbdata;
|
||||
orte_process_name_t *sender = NULL, swap_dest;
|
||||
orte_errmgr_tool_cmd_flag_t command;
|
||||
orte_std_cntr_t count = 1;
|
||||
char *off_nodes = NULL;
|
||||
char *off_procs = NULL;
|
||||
char *onto_nodes = NULL;
|
||||
char **split_off_nodes = NULL;
|
||||
char **split_off_procs = NULL;
|
||||
char **split_onto_nodes = NULL;
|
||||
opal_list_t *proc_list = NULL;
|
||||
opal_list_t *node_list = NULL;
|
||||
opal_list_t *suggested_map_list = NULL;
|
||||
orte_errmgr_predicted_proc_t *off_proc = NULL;
|
||||
orte_errmgr_predicted_node_t *off_node = NULL;
|
||||
orte_errmgr_predicted_map_t *onto_map = NULL;
|
||||
int cnt = 0, i;
|
||||
|
||||
sender = &(mev->sender);
|
||||
|
||||
/*
|
||||
* If we are already interacting with a command line tool then reject this
|
||||
* request. Since we only allow the processing of one tool command at a
|
||||
* time.
|
||||
*/
|
||||
if( OPAL_EQUAL != orte_util_compare_name_fields(ORTE_NS_CMP_ALL, ORTE_NAME_INVALID, &errmgr_cmdline_sender) ) {
|
||||
swap_dest.jobid = errmgr_cmdline_sender.jobid;
|
||||
swap_dest.vpid = errmgr_cmdline_sender.vpid;
|
||||
|
||||
errmgr_cmdline_sender = *sender;
|
||||
orte_errmgr_base_migrate_update(ORTE_ERRMGR_MIGRATE_STATE_ERR_INPROGRESS);
|
||||
|
||||
errmgr_cmdline_sender.jobid = swap_dest.jobid;
|
||||
errmgr_cmdline_sender.vpid = swap_dest.vpid;
|
||||
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
errmgr_cmdline_sender = *sender;
|
||||
|
||||
count = 1;
|
||||
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &command, &count, ORTE_ERRMGR_MIGRATE_TOOL_CMD))) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* orte-migrate has requested that a checkpoint be taken
|
||||
*/
|
||||
if (ORTE_ERRMGR_MIGRATE_TOOL_INIT_CMD == command) {
|
||||
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||
"errmgr:base:tool:recv() Command line requested process migration [command %d]\n",
|
||||
command));
|
||||
|
||||
/*
|
||||
* Unpack the buffer from the orte-migrate command
|
||||
*/
|
||||
count = 1;
|
||||
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(off_procs), &count, OPAL_STRING))) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(off_nodes), &count, OPAL_STRING))) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
if (ORTE_SUCCESS != (ret = opal_dss.unpack(mev->buffer, &(onto_nodes), &count, OPAL_STRING))) {
|
||||
ORTE_ERROR_LOG(ret);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
/*
|
||||
* Parse the comma separated list
|
||||
*/
|
||||
proc_list = OBJ_NEW(opal_list_t);
|
||||
node_list = OBJ_NEW(opal_list_t);
|
||||
suggested_map_list = OBJ_NEW(opal_list_t);
|
||||
|
||||
split_off_procs = opal_argv_split(off_procs, ',');
|
||||
cnt = opal_argv_count(split_off_procs);
|
||||
if( cnt > 0 ) {
|
||||
for(i = 0; i < cnt; ++i) {
|
||||
off_proc = OBJ_NEW(orte_errmgr_predicted_proc_t);
|
||||
off_proc->proc_name.vpid = atoi(split_off_procs[i]);
|
||||
opal_list_append(proc_list, &(off_proc->super));
|
||||
}
|
||||
}
|
||||
|
||||
split_off_nodes = opal_argv_split(off_nodes, ',');
|
||||
cnt = opal_argv_count(split_off_nodes);
|
||||
if( cnt > 0 ) {
|
||||
for(i = 0; i < cnt; ++i) {
|
||||
off_node = OBJ_NEW(orte_errmgr_predicted_node_t);
|
||||
off_node->node_name = strdup(split_off_nodes[i]);
|
||||
opal_list_append(node_list, &(off_node->super));
|
||||
}
|
||||
}
|
||||
|
||||
split_onto_nodes = opal_argv_split(onto_nodes, ',');
|
||||
cnt = opal_argv_count(split_onto_nodes);
|
||||
if( cnt > 0 ) {
|
||||
for(i = 0; i < cnt; ++i) {
|
||||
onto_map = OBJ_NEW(orte_errmgr_predicted_map_t);
|
||||
onto_map->map_node_name = strdup(split_onto_nodes[i]);
|
||||
opal_list_append(suggested_map_list, &(onto_map->super));
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* Pass to the predicted fault function to see how they would like to progress
|
||||
*/
|
||||
orte_errmgr_base_predicted_fault(proc_list, node_list, suggested_map_list);
|
||||
}
|
||||
/*
|
||||
* Unknown command
|
||||
*/
|
||||
else {
|
||||
OPAL_OUTPUT_VERBOSE((10, orte_errmgr_base.output,
|
||||
"errmgr:base:tool:recv() Command line sent an unknown command (command %d)\n",
|
||||
command));
|
||||
ORTE_ERROR_LOG(ORTE_ERR_NOT_SUPPORTED);
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
cleanup:
|
||||
/* release the message event */
|
||||
OBJ_RELEASE(mev);
|
||||
|
||||
return;
|
||||
}
|
||||
#endif
|
@ -72,9 +72,10 @@ ORTE_DECLSPEC int orte_errmgr_base_abort(int error_code, char *fmt, ...)
|
||||
__opal_attribute_format__(__printf__, 2, 3)
|
||||
# endif
|
||||
;
|
||||
ORTE_DECLSPEC int orte_errmgr_base_predicted_fault(char ***proc_list,
|
||||
char ***node_list,
|
||||
char ***suggested_nodes);
|
||||
|
||||
ORTE_DECLSPEC int orte_errmgr_base_predicted_fault(opal_list_t *proc_list,
|
||||
opal_list_t *node_list,
|
||||
opal_list_t *suggested_map);
|
||||
ORTE_DECLSPEC int orte_errmgr_base_suggest_map_targets(orte_proc_t *proc,
|
||||
orte_node_t *oldnode,
|
||||
opal_list_t *node_list);
|
||||
|
38
orte/mca/errmgr/crmig/Makefile.am
Обычный файл
38
orte/mca/errmgr/crmig/Makefile.am
Обычный файл
@ -0,0 +1,38 @@
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
dist_pkgdata_DATA = help-orte-errmgr-crmig.txt
|
||||
|
||||
sources = \
|
||||
errmgr_crmig.h \
|
||||
errmgr_crmig_component.c \
|
||||
errmgr_crmig_module.c
|
||||
|
||||
# Make the output library in this directory, and name it either
|
||||
# mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
|
||||
# (for static builds).
|
||||
|
||||
if OMPI_BUILD_errmgr_crmig_DSO
|
||||
component_noinst =
|
||||
component_install = mca_errmgr_crmig.la
|
||||
else
|
||||
component_noinst = libmca_errmgr_crmig.la
|
||||
component_install =
|
||||
endif
|
||||
|
||||
mcacomponentdir = $(pkglibdir)
|
||||
mcacomponent_LTLIBRARIES = $(component_install)
|
||||
mca_errmgr_crmig_la_SOURCES = $(sources)
|
||||
mca_errmgr_crmig_la_LDFLAGS = -module -avoid-version
|
||||
|
||||
noinst_LTLIBRARIES = $(component_noinst)
|
||||
libmca_errmgr_crmig_la_SOURCES = $(sources)
|
||||
libmca_errmgr_crmig_la_LDFLAGS = -module -avoid-version
|
20
orte/mca/errmgr/crmig/configure.m4
Обычный файл
20
orte/mca/errmgr/crmig/configure.m4
Обычный файл
@ -0,0 +1,20 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
# MCA_errmgr_crmig_CONFIG([action-if-found], [action-if-not-found])
|
||||
# -----------------------------------------------------------
|
||||
AC_DEFUN([MCA_errmgr_crmig_CONFIG],[
|
||||
# If we don't want FT, don't compile this component
|
||||
AS_IF([test "$opal_want_ft_cr" = "1"],
|
||||
[$1],
|
||||
[$2])
|
||||
])dnl
|
14
orte/mca/errmgr/crmig/configure.params
Обычный файл
14
orte/mca/errmgr/crmig/configure.params
Обычный файл
@ -0,0 +1,14 @@
|
||||
# -*- shell-script -*-
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
# All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
|
||||
PARAM_INIT_FILE=errmgr_crmig_component.c
|
||||
PARAM_CONFIG_FILES="Makefile"
|
93
orte/mca/errmgr/crmig/errmgr_crmig.h
Обычный файл
93
orte/mca/errmgr/crmig/errmgr_crmig.h
Обычный файл
@ -0,0 +1,93 @@
|
||||
/*
|
||||
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
/**
|
||||
* @file
|
||||
*
|
||||
* Checkpoint/Restart Process Migration (CRMIG) ErrMgr component
|
||||
*
|
||||
* Simple, braindead implementation.
|
||||
*/
|
||||
|
||||
#ifndef MCA_ERRMGR_CRMIG_EXPORT_H
|
||||
#define MCA_ERRMGR_CRMIG_EXPORT_H
|
||||
|
||||
#include "orte_config.h"
|
||||
|
||||
#include "opal/mca/mca.h"
|
||||
#include "opal/event/event.h"
|
||||
|
||||
#include "orte/mca/filem/filem.h"
|
||||
#include "orte/mca/errmgr/errmgr.h"
|
||||
|
||||
BEGIN_C_DECLS
|
||||
|
||||
/*
|
||||
* Local Component structures
|
||||
*/
|
||||
struct orte_errmgr_crmig_component_t {
|
||||
orte_errmgr_base_component_t super; /** Base Errmgr component */
|
||||
bool crmig_enabled;
|
||||
bool timing_enabled;
|
||||
};
|
||||
typedef struct orte_errmgr_crmig_component_t orte_errmgr_crmig_component_t;
|
||||
OPAL_MODULE_DECLSPEC extern orte_errmgr_crmig_component_t mca_errmgr_crmig_component;
|
||||
|
||||
int orte_errmgr_crmig_component_query(mca_base_module_t **module, int *priority);
|
||||
|
||||
/*
|
||||
* Module functions: Global
|
||||
*/
|
||||
int orte_errmgr_crmig_global_module_init(void);
|
||||
int orte_errmgr_crmig_global_module_finalize(void);
|
||||
|
||||
int orte_errmgr_crmig_global_update_state(orte_jobid_t job,
|
||||
orte_job_state_t jobstate,
|
||||
orte_process_name_t *proc_name,
|
||||
orte_proc_state_t state,
|
||||
pid_t pid,
|
||||
orte_exit_code_t exit_code,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
|
||||
int orte_errmgr_crmig_global_predicted_fault(opal_list_t *proc_list,
|
||||
opal_list_t *node_list,
|
||||
opal_list_t *suggested_map,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
int orte_errmgr_crmig_global_process_fault(orte_job_t *jdata,
|
||||
orte_process_name_t *proc_name,
|
||||
orte_proc_state_t state,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
int orte_errmgr_crmig_global_suggest_map_targets(orte_proc_t *proc,
|
||||
orte_node_t *oldnode,
|
||||
opal_list_t *node_list,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
|
||||
int orte_errmgr_crmig_global_ft_event(int state);
|
||||
|
||||
/*
|
||||
* Module functions: Local
|
||||
*/
|
||||
int orte_errmgr_crmig_local_module_init(void);
|
||||
int orte_errmgr_crmig_local_module_finalize(void);
|
||||
|
||||
int orte_errmgr_crmig_local_update_state(orte_jobid_t job,
|
||||
orte_job_state_t jobstate,
|
||||
orte_process_name_t *proc_name,
|
||||
orte_proc_state_t state,
|
||||
pid_t pid,
|
||||
orte_exit_code_t exit_code,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
int orte_errmgr_crmig_local_ft_event(int state);
|
||||
|
||||
|
||||
END_C_DECLS
|
||||
|
||||
#endif /* MCA_ERRMGR_CRMIG_EXPORT_H */
|
142
orte/mca/errmgr/crmig/errmgr_crmig_component.c
Обычный файл
142
orte/mca/errmgr/crmig/errmgr_crmig_component.c
Обычный файл
@ -0,0 +1,142 @@
|
||||
/*
|
||||
* Copyright (c) 2009-2010 The Trustees of Indiana University.
|
||||
* All rights reserved.
|
||||
*
|
||||
* $COPYRIGHT$
|
||||
*
|
||||
* Additional copyrights may follow
|
||||
*
|
||||
* $HEADER$
|
||||
*/
|
||||
|
||||
#include "orte_config.h"
|
||||
#include "opal/util/output.h"
|
||||
|
||||
#include "orte/mca/errmgr/errmgr.h"
|
||||
#include "orte/mca/errmgr/base/base.h"
|
||||
#include "orte/mca/errmgr/base/errmgr_private.h"
|
||||
#include "errmgr_crmig.h"
|
||||
|
||||
/*
|
||||
* Public string for version number
|
||||
*/
|
||||
const char *orte_errmgr_crmig_component_version_string =
|
||||
"ORTE ERRMGR crmig MCA component version " ORTE_VERSION;
|
||||
|
||||
/*
|
||||
* Local functionality
|
||||
*/
|
||||
static int errmgr_crmig_open(void);
|
||||
static int errmgr_crmig_close(void);
|
||||
|
||||
/*
|
||||
* Instantiate the public struct with all of our public information
|
||||
* and pointer to our public functions in it
|
||||
*/
|
||||
orte_errmgr_crmig_component_t mca_errmgr_crmig_component = {
|
||||
/* First do the base component stuff */
|
||||
{
|
||||
/* Handle the general mca_component_t struct containing
|
||||
* meta information about the component itcrmig
|
||||
*/
|
||||
{
|
||||
ORTE_ERRMGR_BASE_VERSION_3_0_0,
|
||||
/* Component name and version */
|
||||
"crmig",
|
||||
ORTE_MAJOR_VERSION,
|
||||
ORTE_MINOR_VERSION,
|
||||
ORTE_RELEASE_VERSION,
|
||||
|
||||
/* Component open and close functions */
|
||||
errmgr_crmig_open,
|
||||
errmgr_crmig_close,
|
||||
orte_errmgr_crmig_component_query
|
||||
},
|
||||
{
|
||||
/* The component is checkpoint ready */
|
||||
MCA_BASE_METADATA_PARAM_CHECKPOINT
|
||||
},
|
||||
|
||||
/* Verbosity level */
|
||||
0,
|
||||
/* opal_output handler */
|
||||
-1,
|
||||
/* Default priority */
|
||||
40
|
||||
}
|
||||
};
|
||||
|
||||
static int errmgr_crmig_open(void)
|
||||
{
|
||||
int val;
|
||||
|
||||
/*
|
||||
* This should be the last componet to ever get used since
|
||||
* it doesn't do anything.
|
||||
*/
|
||||
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||
"priority",
|
||||
"Priority of the ERRMGR crmig component",
|
||||
false, false,
|
||||
mca_errmgr_crmig_component.super.priority,
|
||||
&mca_errmgr_crmig_component.super.priority);
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||
"verbose",
|
||||
"Verbose level for the ERRMGR crmig component",
|
||||
false, false,
|
||||
mca_errmgr_crmig_component.super.verbose,
|
||||
&mca_errmgr_crmig_component.super.verbose);
|
||||
/* If there is a custom verbose level for this component than use it
|
||||
* otherwise take our parents level and output channel
|
||||
*/
|
||||
if ( 0 != mca_errmgr_crmig_component.super.verbose) {
|
||||
mca_errmgr_crmig_component.super.output_handle = opal_output_open(NULL);
|
||||
opal_output_set_verbosity(mca_errmgr_crmig_component.super.output_handle,
|
||||
mca_errmgr_crmig_component.super.verbose);
|
||||
} else {
|
||||
mca_errmgr_crmig_component.super.output_handle = orte_errmgr_base.output;
|
||||
}
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||
"timing",
|
||||
"Enable Process Migration timer",
|
||||
false, false,
|
||||
0, &val);
|
||||
mca_errmgr_crmig_component.timing_enabled = OPAL_INT_TO_BOOL(val);
|
||||
|
||||
mca_base_param_reg_int(&mca_errmgr_crmig_component.super.base_version,
|
||||
"enable",
|
||||
"Enable Process Migration (Default: 0/off)",
|
||||
false, false,
|
||||
0, &val);
|
||||
mca_errmgr_crmig_component.crmig_enabled = OPAL_INT_TO_BOOL(val);
|
||||
|
||||
/*
|
||||
* Debug Output
|
||||
*/
|
||||
opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle,
|
||||
"errmgr:crmig: open()");
|
||||
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||
"errmgr:crmig: open: priority = %d",
|
||||
mca_errmgr_crmig_component.super.priority);
|
||||
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||
"errmgr:crmig: open: verbosity = %d",
|
||||
mca_errmgr_crmig_component.super.verbose);
|
||||
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||
"errmgr:crmig: open: Proc. Mig. = %s",
|
||||
(mca_errmgr_crmig_component.crmig_enabled ? "Enabled" : "Disabled"));
|
||||
opal_output_verbose(20, mca_errmgr_crmig_component.super.output_handle,
|
||||
"errmgr:crmig: open: timing = %s",
|
||||
(mca_errmgr_crmig_component.timing_enabled ? "Enabled" : "Disabled"));
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
||||
|
||||
static int errmgr_crmig_close(void)
|
||||
{
|
||||
opal_output_verbose(10, mca_errmgr_crmig_component.super.output_handle,
|
||||
"errmgr:crmig: close()");
|
||||
|
||||
return ORTE_SUCCESS;
|
||||
}
|
1678
orte/mca/errmgr/crmig/errmgr_crmig_module.c
Обычный файл
1678
orte/mca/errmgr/crmig/errmgr_crmig_module.c
Обычный файл
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
27
orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt
Обычный файл
27
orte/mca/errmgr/crmig/help-orte-errmgr-crmig.txt
Обычный файл
@ -0,0 +1,27 @@
|
||||
-*- text -*-
|
||||
#
|
||||
# Copyright (c) 2009-2010 The Trustees of Indiana University and Indiana
|
||||
# University Research and Technology
|
||||
# Corporation. All rights reserved.
|
||||
#
|
||||
# $COPYRIGHT$
|
||||
#
|
||||
# Additional copyrights may follow
|
||||
#
|
||||
# $HEADER$
|
||||
#
|
||||
# This is the US/English general help file for ORTE ErrMgr CRMig framework.
|
||||
#
|
||||
[migrating_job]
|
||||
Notice: A migration of this job has been requested.
|
||||
The processes below will be migrated.
|
||||
Please standby.
|
||||
%s
|
||||
[migrated_job]
|
||||
Notice: The processes have been successfully migrated to/from the specified
|
||||
machines.
|
||||
[no_migrating_procs]
|
||||
Warning: Could not find any processes to migrate on the nodes specified.
|
||||
You provided the following:
|
||||
Nodes: %s
|
||||
Procs: %s
|
@ -79,6 +79,70 @@ BEGIN_C_DECLS
|
||||
/* type definition */
|
||||
typedef uint8_t orte_errmgr_stack_state_t;
|
||||
|
||||
/*
|
||||
* Structure to describe a predicted process fault.
|
||||
*
|
||||
* This can be expanded in the future to support assurance levels, and
|
||||
* additional information that may wish to be conveyed.
|
||||
*/
|
||||
struct orte_errmgr_predicted_proc_t {
|
||||
/** This is an object, so must have a super */
|
||||
opal_list_item_t super;
|
||||
|
||||
/** Process Name */
|
||||
orte_process_name_t proc_name;
|
||||
};
|
||||
typedef struct orte_errmgr_predicted_proc_t orte_errmgr_predicted_proc_t;
|
||||
OBJ_CLASS_DECLARATION(orte_errmgr_predicted_proc_t);
|
||||
|
||||
/*
|
||||
* Structure to describe a predicted node fault.
|
||||
*
|
||||
* This can be expanded in the future to support assurance levels, and
|
||||
* additional information that may wish to be conveyed.
|
||||
*/
|
||||
struct orte_errmgr_predicted_node_t {
|
||||
/** This is an object, so must have a super */
|
||||
opal_list_item_t super;
|
||||
|
||||
/** Node Name */
|
||||
char * node_name;
|
||||
};
|
||||
typedef struct orte_errmgr_predicted_node_t orte_errmgr_predicted_node_t;
|
||||
OBJ_CLASS_DECLARATION(orte_errmgr_predicted_node_t);
|
||||
|
||||
/*
|
||||
* Structure to describe a suggested remapping element for a predicted fault.
|
||||
*
|
||||
* This can be expanded in the future to support weights , and
|
||||
* additional information that may wish to be conveyed.
|
||||
*/
|
||||
struct orte_errmgr_predicted_map_t {
|
||||
/** This is an object, so must have a super */
|
||||
opal_list_item_t super;
|
||||
|
||||
/** Process Name (predicted to fail) */
|
||||
orte_process_name_t proc_name;
|
||||
|
||||
/** Node Name (predicted to fail) */
|
||||
char * node_name;
|
||||
|
||||
/** Process Name (Map to) */
|
||||
orte_process_name_t map_proc_name;
|
||||
|
||||
/** Node Name (Map to) */
|
||||
char * map_node_name;
|
||||
|
||||
/** Just off current node */
|
||||
bool off_current_node;
|
||||
|
||||
/** Pre-map fixed node assignment */
|
||||
char * pre_map_fixed_node;
|
||||
};
|
||||
typedef struct orte_errmgr_predicted_map_t orte_errmgr_predicted_map_t;
|
||||
OBJ_CLASS_DECLARATION(orte_errmgr_predicted_map_t);
|
||||
|
||||
|
||||
/*
|
||||
* Macro definitions
|
||||
*/
|
||||
@ -129,14 +193,15 @@ typedef int (*orte_errmgr_base_API_update_state_fn_t)(orte_jobid_t job,
|
||||
*
|
||||
* @param[in] proc_list List of processes (or NULL if none)
|
||||
* @param[in] node_list List of nodes (or NULL if none)
|
||||
* @param[in] suggested_nodes List of suggested nodes to use on recovery (or NULL if none)
|
||||
* @param[in] suggested_map List of mapping suggestions to use on recovery (or NULL if none)
|
||||
*
|
||||
* @retval ORTE_SUCCESS The operation completed successfully
|
||||
* @retval ORTE_ERROR An unspecifed error occurred
|
||||
*/
|
||||
typedef int (*orte_errmgr_base_API_predicted_fault_fn_t)(char ***proc_list,
|
||||
char ***node_list,
|
||||
char ***suggested_nodes);
|
||||
typedef int (*orte_errmgr_base_API_predicted_fault_fn_t)(opal_list_t *proc_list,
|
||||
opal_list_t *node_list,
|
||||
opal_list_t *suggested_map);
|
||||
|
||||
/**
|
||||
* Suggest a node to map a restarting process onto
|
||||
*
|
||||
@ -212,9 +277,9 @@ typedef int (*orte_errmgr_base_module_update_state_fn_t)(orte_jobid_t job,
|
||||
pid_t pid,
|
||||
orte_exit_code_t exit_code,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
typedef int (*orte_errmgr_base_module_predicted_fault_fn_t)(char ***proc_list,
|
||||
char ***node_list,
|
||||
char ***suggested_nodes,
|
||||
typedef int (*orte_errmgr_base_module_predicted_fault_fn_t)(opal_list_t *proc_list,
|
||||
opal_list_t *node_list,
|
||||
opal_list_t *suggested_map,
|
||||
orte_errmgr_stack_state_t *stack_state);
|
||||
typedef int (*orte_errmgr_base_module_suggest_map_targets_fn_t)(orte_proc_t *proc,
|
||||
orte_node_t *oldnode,
|
||||
|
Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше
Загрузка…
Ссылка в новой задаче
Block a user