2015-02-19 13:41:41 -07:00
|
|
|
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
|
2007-03-16 23:11:45 +00:00
|
|
|
/*
|
2010-03-12 23:57:50 +00:00
|
|
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
2007-03-16 23:11:45 +00:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2015-02-19 13:41:41 -07:00
|
|
|
* Copyright (c) 2015 Los Alamos National Security, LLC. All rights
|
|
|
|
* reserved.
|
2007-03-16 23:11:45 +00:00
|
|
|
* $COPYRIGHT$
|
|
|
|
*
|
|
|
|
* Additional copyrights may follow
|
|
|
|
*
|
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
/**
|
|
|
|
* @file
|
|
|
|
*
|
|
|
|
* Checkpoint/Restart Coordination Protocol (CRCP) Interface
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
#ifndef MCA_CRCP_H
|
|
|
|
#define MCA_CRCP_H
|
|
|
|
|
|
|
|
#include "ompi_config.h"
|
|
|
|
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
#include "opal/class/opal_object.h"
|
2007-03-16 23:11:45 +00:00
|
|
|
#include "opal/mca/mca.h"
|
|
|
|
#include "opal/mca/base/base.h"
|
|
|
|
#include "opal/mca/crs/crs.h"
|
|
|
|
#include "opal/mca/crs/base/base.h"
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 00:47:28 +00:00
|
|
|
#include "opal/mca/btl/btl.h"
|
|
|
|
#include "opal/mca/btl/base/base.h"
|
2015-02-19 13:41:41 -07:00
|
|
|
#include "opal/class/opal_free_list.h"
|
2007-03-16 23:11:45 +00:00
|
|
|
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 04:56:31 +00:00
|
|
|
#include "ompi/datatype/ompi_datatype.h"
|
2007-03-16 23:11:45 +00:00
|
|
|
#include "ompi/request/request.h"
|
|
|
|
#include "ompi/mca/pml/pml.h"
|
|
|
|
#include "ompi/mca/pml/base/base.h"
|
|
|
|
|
|
|
|
|
2009-08-20 11:42:18 +00:00
|
|
|
BEGIN_C_DECLS
|
2007-03-20 14:12:13 +00:00
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
/**
|
|
|
|
* Module initialization function.
|
|
|
|
* Returns OMPI_SUCCESS
|
|
|
|
*/
|
|
|
|
typedef int (*ompi_crcp_base_module_init_fn_t)
|
|
|
|
(void);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* Module finalization function.
|
|
|
|
* Returns OMPI_SUCCESS
|
|
|
|
*/
|
|
|
|
typedef int (*ompi_crcp_base_module_finalize_fn_t)
|
|
|
|
(void);
|
|
|
|
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-10 20:51:11 +00:00
|
|
|
|
|
|
|
/************************
|
|
|
|
* MPI Quiesce Interface
|
|
|
|
************************/
|
|
|
|
/**
|
|
|
|
* MPI_Quiesce_start component interface
|
|
|
|
*/
|
|
|
|
typedef int (*ompi_crcp_base_quiesce_start_fn_t)
|
|
|
|
(MPI_Info *info);
|
|
|
|
|
|
|
|
/**
|
|
|
|
* MPI_Quiesce_end component interface
|
|
|
|
*/
|
|
|
|
typedef int (*ompi_crcp_base_quiesce_end_fn_t)
|
|
|
|
(MPI_Info *info);
|
|
|
|
|
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
/************************
|
|
|
|
* PML Wrapper hooks
|
|
|
|
* PML Wrapper is the CRCPW PML component
|
|
|
|
************************/
|
|
|
|
/**
|
|
|
|
* To allow us to work before and after a PML command
|
|
|
|
*/
|
|
|
|
enum ompi_crcp_base_pml_states_t {
|
|
|
|
OMPI_CRCP_PML_PRE,
|
|
|
|
OMPI_CRCP_PML_POST,
|
|
|
|
OMPI_CRCP_PML_SKIP,
|
|
|
|
OMPI_CRCP_PML_DONE
|
|
|
|
};
|
|
|
|
typedef enum ompi_crcp_base_pml_states_t ompi_crcp_base_pml_states_t;
|
|
|
|
|
2008-12-10 15:38:12 +00:00
|
|
|
struct ompi_crcp_base_pml_state_t {
|
2015-02-19 13:41:41 -07:00
|
|
|
opal_free_list_item_t super;
|
2007-03-16 23:11:45 +00:00
|
|
|
ompi_crcp_base_pml_states_t state;
|
|
|
|
int error_code;
|
|
|
|
mca_pml_base_component_t *wrapped_pml_component;
|
|
|
|
mca_pml_base_module_t *wrapped_pml_module;
|
|
|
|
};
|
|
|
|
typedef struct ompi_crcp_base_pml_state_t ompi_crcp_base_pml_state_t;
|
2008-03-05 04:57:23 +00:00
|
|
|
OMPI_DECLSPEC OBJ_CLASS_DECLARATION(ompi_crcp_base_pml_state_t);
|
2007-03-16 23:11:45 +00:00
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_enable_fn_t)
|
|
|
|
(bool enable, ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_add_comm_fn_t)
|
|
|
|
( struct ompi_communicator_t* comm , ompi_crcp_base_pml_state_t*);
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_del_comm_fn_t)
|
|
|
|
( struct ompi_communicator_t* comm , ompi_crcp_base_pml_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_add_procs_fn_t)
|
|
|
|
( struct ompi_proc_t **procs, size_t nprocs , ompi_crcp_base_pml_state_t*);
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_del_procs_fn_t)
|
|
|
|
( struct ompi_proc_t **procs, size_t nprocs , ompi_crcp_base_pml_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_progress_fn_t)
|
|
|
|
(ompi_crcp_base_pml_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_iprobe_fn_t)
|
|
|
|
(int dst, int tag, struct ompi_communicator_t* comm, int *matched,
|
|
|
|
ompi_status_public_t* status, ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_probe_fn_t)
|
|
|
|
( int dst, int tag, struct ompi_communicator_t* comm,
|
|
|
|
ompi_status_public_t* status, ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_isend_init_fn_t)
|
|
|
|
( void *buf, size_t count, ompi_datatype_t *datatype, int dst, int tag,
|
|
|
|
mca_pml_base_send_mode_t mode, struct ompi_communicator_t* comm,
|
|
|
|
struct ompi_request_t **request, ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_isend_fn_t)
|
|
|
|
( void *buf, size_t count, ompi_datatype_t *datatype, int dst, int tag,
|
|
|
|
mca_pml_base_send_mode_t mode, struct ompi_communicator_t* comm,
|
|
|
|
struct ompi_request_t **request, ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_send_fn_t)
|
|
|
|
( void *buf, size_t count, ompi_datatype_t *datatype, int dst, int tag,
|
|
|
|
mca_pml_base_send_mode_t mode, struct ompi_communicator_t* comm,
|
|
|
|
ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_irecv_init_fn_t)
|
|
|
|
( void *buf, size_t count, ompi_datatype_t *datatype, int src, int tag,
|
|
|
|
struct ompi_communicator_t* comm, struct ompi_request_t **request,
|
|
|
|
ompi_crcp_base_pml_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_irecv_fn_t)
|
|
|
|
( void *buf, size_t count, ompi_datatype_t *datatype, int src, int tag,
|
|
|
|
struct ompi_communicator_t* comm, struct ompi_request_t **request,
|
|
|
|
ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_recv_fn_t)
|
|
|
|
( void *buf, size_t count, ompi_datatype_t *datatype, int src, int tag,
|
|
|
|
struct ompi_communicator_t* comm, ompi_status_public_t* status,
|
|
|
|
ompi_crcp_base_pml_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_dump_fn_t)
|
|
|
|
( struct ompi_communicator_t* comm, int verbose, ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_start_fn_t)
|
|
|
|
( size_t count, ompi_request_t** requests, ompi_crcp_base_pml_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_pml_state_t* (*ompi_crcp_base_pml_ft_event_fn_t)
|
|
|
|
(int state, ompi_crcp_base_pml_state_t*);
|
|
|
|
|
|
|
|
/* Request Interface */
|
|
|
|
typedef int (*ompi_crcp_base_request_complete_fn_t)
|
|
|
|
(struct ompi_request_t *request);
|
|
|
|
|
|
|
|
/************************
|
|
|
|
* BTL Wrapper hooks
|
|
|
|
* JJH: Wrapper BTL not currently implemented.
|
|
|
|
************************/
|
|
|
|
/**
|
|
|
|
* To allow us to work before and after a BTL command
|
|
|
|
*/
|
|
|
|
enum ompi_crcp_base_btl_states_t {
|
|
|
|
OMPI_CRCP_BTL_PRE,
|
|
|
|
OMPI_CRCP_BTL_POST,
|
|
|
|
OMPI_CRCP_BTL_SKIP,
|
|
|
|
OMPI_CRCP_BTL_DONE
|
|
|
|
};
|
|
|
|
typedef enum ompi_crcp_base_btl_states_t ompi_crcp_base_btl_states_t;
|
|
|
|
|
|
|
|
struct ompi_crcp_base_btl_state_t {
|
2015-02-19 13:41:41 -07:00
|
|
|
opal_free_list_item_t super;
|
2007-03-16 23:11:45 +00:00
|
|
|
ompi_crcp_base_btl_states_t state;
|
|
|
|
int error_code;
|
|
|
|
mca_btl_base_descriptor_t* des;
|
|
|
|
mca_btl_base_component_t *wrapped_btl_component;
|
|
|
|
mca_btl_base_module_t *wrapped_btl_module;
|
|
|
|
};
|
|
|
|
typedef struct ompi_crcp_base_btl_state_t ompi_crcp_base_btl_state_t;
|
|
|
|
OBJ_CLASS_DECLARATION(ompi_crcp_base_btl_state_t);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_add_procs_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
size_t nprocs,
|
|
|
|
struct ompi_proc_t** procs,
|
|
|
|
struct mca_btl_base_endpoint_t** endpoints,
|
2009-03-03 22:25:13 +00:00
|
|
|
struct opal_bitmap_t* reachable,
|
2007-03-16 23:11:45 +00:00
|
|
|
ompi_crcp_base_btl_state_t* );
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_del_procs_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
size_t nprocs,
|
|
|
|
struct ompi_proc_t** procs,
|
|
|
|
struct mca_btl_base_endpoint_t**,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_register_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
mca_btl_base_tag_t tag,
|
|
|
|
mca_btl_base_module_recv_cb_fn_t cbfunc,
|
|
|
|
void* cbdata,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_finalize_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_alloc_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
size_t size,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_free_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
mca_btl_base_descriptor_t* descriptor,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_prepare_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* endpoint,
|
|
|
|
mca_mpool_base_registration_t* registration,
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 04:56:31 +00:00
|
|
|
struct opal_convertor_t* convertor,
|
2007-03-16 23:11:45 +00:00
|
|
|
size_t reserve,
|
|
|
|
size_t* size,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_send_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* endpoint,
|
|
|
|
struct mca_btl_base_descriptor_t* descriptor,
|
|
|
|
mca_btl_base_tag_t tag,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_put_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* endpoint,
|
|
|
|
struct mca_btl_base_descriptor_t* descriptor,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_get_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* endpoint,
|
|
|
|
struct mca_btl_base_descriptor_t* descriptor,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_dump_fn_t)
|
|
|
|
( struct mca_btl_base_module_t* btl,
|
|
|
|
struct mca_btl_base_endpoint_t* endpoint,
|
|
|
|
int verbose,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
typedef ompi_crcp_base_btl_state_t* (*mca_crcp_base_btl_module_ft_event_fn_t)
|
|
|
|
(int state,
|
|
|
|
ompi_crcp_base_btl_state_t*);
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
2008-07-28 22:40:57 +00:00
|
|
|
* Structure for CRCP components.
|
2007-03-16 23:11:45 +00:00
|
|
|
*/
|
2008-07-28 22:40:57 +00:00
|
|
|
struct ompi_crcp_base_component_2_0_0_t {
|
2007-03-16 23:11:45 +00:00
|
|
|
/** MCA base component */
|
2008-05-06 18:08:45 +00:00
|
|
|
mca_base_component_t base_version;
|
2007-03-16 23:11:45 +00:00
|
|
|
/** MCA base data */
|
2008-07-28 22:40:57 +00:00
|
|
|
mca_base_component_data_t base_data;
|
2007-03-16 23:11:45 +00:00
|
|
|
|
|
|
|
/** Verbosity Level */
|
|
|
|
int verbose;
|
2008-06-09 14:53:58 +00:00
|
|
|
/** Output Handle for opal_output */
|
2007-03-16 23:11:45 +00:00
|
|
|
int output_handle;
|
|
|
|
/** Default Priority */
|
|
|
|
int priority;
|
|
|
|
|
|
|
|
};
|
2008-07-28 22:40:57 +00:00
|
|
|
typedef struct ompi_crcp_base_component_2_0_0_t ompi_crcp_base_component_2_0_0_t;
|
|
|
|
typedef struct ompi_crcp_base_component_2_0_0_t ompi_crcp_base_component_t;
|
2007-03-16 23:11:45 +00:00
|
|
|
|
|
|
|
/**
|
2008-07-28 22:40:57 +00:00
|
|
|
* Structure for CRCP modules
|
2007-03-16 23:11:45 +00:00
|
|
|
*/
|
|
|
|
struct ompi_crcp_base_module_1_0_0_t {
|
|
|
|
/** Initialization Function */
|
|
|
|
ompi_crcp_base_module_init_fn_t crcp_init;
|
|
|
|
/** Finalization Function */
|
|
|
|
ompi_crcp_base_module_finalize_fn_t crcp_finalize;
|
|
|
|
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-10 20:51:11 +00:00
|
|
|
/**< MPI_Quiesce Interface Functions ******************/
|
|
|
|
ompi_crcp_base_quiesce_start_fn_t quiesce_start;
|
|
|
|
ompi_crcp_base_quiesce_end_fn_t quiesce_end;
|
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
/**< PML Wrapper Functions ****************************/
|
|
|
|
ompi_crcp_base_pml_enable_fn_t pml_enable;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_add_comm_fn_t pml_add_comm;
|
|
|
|
ompi_crcp_base_pml_del_comm_fn_t pml_del_comm;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_add_procs_fn_t pml_add_procs;
|
|
|
|
ompi_crcp_base_pml_del_procs_fn_t pml_del_procs;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_progress_fn_t pml_progress;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_iprobe_fn_t pml_iprobe;
|
|
|
|
ompi_crcp_base_pml_probe_fn_t pml_probe;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_isend_init_fn_t pml_isend_init;
|
|
|
|
ompi_crcp_base_pml_isend_fn_t pml_isend;
|
|
|
|
ompi_crcp_base_pml_send_fn_t pml_send;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_irecv_init_fn_t pml_irecv_init;
|
|
|
|
ompi_crcp_base_pml_irecv_fn_t pml_irecv;
|
|
|
|
ompi_crcp_base_pml_recv_fn_t pml_recv;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_dump_fn_t pml_dump;
|
|
|
|
ompi_crcp_base_pml_start_fn_t pml_start;
|
|
|
|
|
|
|
|
ompi_crcp_base_pml_ft_event_fn_t pml_ft_event;
|
|
|
|
|
|
|
|
/**< Request complete Function ****************************/
|
|
|
|
ompi_crcp_base_request_complete_fn_t request_complete;
|
|
|
|
|
|
|
|
/**< BTL Wrapper Functions ****************************/
|
|
|
|
mca_crcp_base_btl_module_add_procs_fn_t btl_add_procs;
|
|
|
|
mca_crcp_base_btl_module_del_procs_fn_t btl_del_procs;
|
|
|
|
|
|
|
|
mca_crcp_base_btl_module_register_fn_t btl_register;
|
|
|
|
mca_crcp_base_btl_module_finalize_fn_t btl_finalize;
|
|
|
|
|
|
|
|
mca_crcp_base_btl_module_alloc_fn_t btl_alloc;
|
|
|
|
mca_crcp_base_btl_module_free_fn_t btl_free;
|
|
|
|
|
|
|
|
mca_crcp_base_btl_module_prepare_fn_t btl_prepare_src;
|
|
|
|
mca_crcp_base_btl_module_prepare_fn_t btl_prepare_dst;
|
|
|
|
|
|
|
|
mca_crcp_base_btl_module_send_fn_t btl_send;
|
|
|
|
mca_crcp_base_btl_module_put_fn_t btl_put;
|
|
|
|
mca_crcp_base_btl_module_get_fn_t btl_get;
|
|
|
|
|
|
|
|
mca_crcp_base_btl_module_dump_fn_t btl_dump;
|
|
|
|
|
|
|
|
mca_crcp_base_btl_module_ft_event_fn_t btl_ft_event;
|
|
|
|
};
|
|
|
|
typedef struct ompi_crcp_base_module_1_0_0_t ompi_crcp_base_module_1_0_0_t;
|
|
|
|
typedef struct ompi_crcp_base_module_1_0_0_t ompi_crcp_base_module_t;
|
|
|
|
|
|
|
|
OMPI_DECLSPEC extern ompi_crcp_base_module_t ompi_crcp;
|
|
|
|
|
|
|
|
/**
|
2008-07-28 22:40:57 +00:00
|
|
|
* Macro for use in components that are of type CRCP
|
2007-03-16 23:11:45 +00:00
|
|
|
*/
|
2008-07-28 22:40:57 +00:00
|
|
|
#define OMPI_CRCP_BASE_VERSION_2_0_0 \
|
|
|
|
MCA_BASE_VERSION_2_0_0, \
|
|
|
|
"crcp", 2, 0, 0
|
2007-03-16 23:11:45 +00:00
|
|
|
|
|
|
|
/**
|
|
|
|
* Macro to call the CRCP Request Complete function
|
|
|
|
*/
|
2010-03-12 23:57:50 +00:00
|
|
|
#if OPAL_ENABLE_FT_CR == 1
|
2007-03-16 23:11:45 +00:00
|
|
|
#define OMPI_CRCP_REQUEST_COMPLETE(req) \
|
|
|
|
if( NULL != ompi_crcp.request_complete) { \
|
|
|
|
ompi_crcp.request_complete(req); \
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
#define OMPI_CRCP_REQUEST_COMPLETE(req) ;
|
|
|
|
#endif
|
|
|
|
|
2009-08-20 11:42:18 +00:00
|
|
|
END_C_DECLS
|
2007-03-20 14:12:13 +00:00
|
|
|
|
2007-03-16 23:11:45 +00:00
|
|
|
#endif /* OMPI_CRCP_H */
|