2013-03-28 01:09:41 +04:00
|
|
|
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
|
2005-05-19 17:33:55 +04:00
|
|
|
/*
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
|
|
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
2005-11-05 22:57:48 +03:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2005-09-07 22:52:28 +04:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
2005-05-19 17:33:55 +04:00
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
2012-02-10 22:29:52 +04:00
|
|
|
* Copyright (c) 2007-2012 Cisco Systems, Inc. All rights reserved.
|
2007-11-03 05:40:22 +03:00
|
|
|
* Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved.
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
* Copyright (c) 2009 Oak Ridge National Labs. All rights reserved.
|
2013-03-28 01:09:41 +04:00
|
|
|
* Copyright (c) 2010-2013 Los Alamos National Security, LLC.
|
2011-06-21 19:41:57 +04:00
|
|
|
* All rights reserved.
|
2014-01-30 15:14:36 +04:00
|
|
|
* Copyright (c) 2013-2014 Intel, Inc. All rights reserved
|
2005-05-19 17:33:55 +04:00
|
|
|
* $COPYRIGHT$
|
2005-09-07 22:52:28 +04:00
|
|
|
*
|
2005-05-19 17:33:55 +04:00
|
|
|
* Additional copyrights may follow
|
2005-09-07 22:52:28 +04:00
|
|
|
*
|
2005-05-19 17:33:55 +04:00
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
/** @file **/
|
|
|
|
|
2006-02-12 04:33:29 +03:00
|
|
|
#include "opal_config.h"
|
2005-05-22 22:40:03 +04:00
|
|
|
|
2005-07-04 05:36:20 +04:00
|
|
|
#include "opal/util/malloc.h"
|
2013-01-15 05:27:36 +04:00
|
|
|
#include "opal/util/arch.h"
|
2005-07-04 03:31:27 +04:00
|
|
|
#include "opal/util/output.h"
|
2005-10-05 17:56:35 +04:00
|
|
|
#include "opal/util/show_help.h"
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#include "opal/util/proc.h"
|
2005-11-11 03:26:27 +03:00
|
|
|
#include "opal/memoryhooks/memory.h"
|
2005-08-13 00:46:25 +04:00
|
|
|
#include "opal/mca/base/base.h"
|
|
|
|
#include "opal/runtime/opal.h"
|
2007-07-19 00:25:01 +04:00
|
|
|
#include "opal/util/net.h"
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
#include "opal/datatype/opal_datatype.h"
|
2007-04-21 04:15:05 +04:00
|
|
|
#include "opal/mca/installdirs/base/base.h"
|
2005-08-14 21:23:34 +04:00
|
|
|
#include "opal/mca/memory/base/base.h"
|
2006-04-05 09:57:51 +04:00
|
|
|
#include "opal/mca/memcpy/base/base.h"
|
2011-09-11 23:02:24 +04:00
|
|
|
#include "opal/mca/hwloc/base/base.h"
|
2014-02-04 05:38:45 +04:00
|
|
|
#include "opal/mca/sec/base/base.h"
|
2005-08-18 09:34:22 +04:00
|
|
|
#include "opal/mca/timer/base/base.h"
|
2008-02-12 11:46:27 +03:00
|
|
|
#include "opal/mca/memchecker/base/base.h"
|
2008-02-28 04:57:57 +03:00
|
|
|
#include "opal/dss/dss.h"
|
2011-06-21 19:41:57 +04:00
|
|
|
#include "opal/mca/shmem/base/base.h"
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
|
|
|
#if OPAL_ENABLE_FT_CR == 1
|
|
|
|
#include "opal/mca/compress/base/base.h"
|
|
|
|
#endif
|
2014-08-01 22:49:37 +04:00
|
|
|
#include "opal/threads/threads.h"
|
2007-03-17 02:11:45 +03:00
|
|
|
|
|
|
|
#include "opal/runtime/opal_cr.h"
|
|
|
|
#include "opal/mca/crs/base/base.h"
|
|
|
|
|
|
|
|
#include "opal/runtime/opal_progress.h"
|
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 22:35:54 +04:00
|
|
|
#include "opal/mca/event/base/base.h"
|
2006-09-26 03:41:06 +04:00
|
|
|
#include "opal/mca/backtrace/base/base.h"
|
2007-03-17 02:11:45 +03:00
|
|
|
|
2006-02-12 04:33:29 +03:00
|
|
|
#include "opal/constants.h"
|
2005-08-22 07:05:39 +04:00
|
|
|
#include "opal/util/error.h"
|
2006-01-11 07:36:39 +03:00
|
|
|
#include "opal/util/stacktrace.h"
|
2006-01-16 04:48:03 +03:00
|
|
|
#include "opal/util/keyval_parse.h"
|
2007-04-23 22:53:47 +04:00
|
|
|
#include "opal/util/sys_limits.h"
|
2005-09-07 22:52:28 +04:00
|
|
|
|
2009-05-07 00:11:28 +04:00
|
|
|
#if OPAL_CC_USE_PRAGMA_IDENT
|
2007-11-03 05:40:22 +03:00
|
|
|
#pragma ident OPAL_IDENT_STRING
|
2009-05-07 00:11:28 +04:00
|
|
|
#elif OPAL_CC_USE_IDENT
|
2007-11-03 05:40:22 +03:00
|
|
|
#ident OPAL_IDENT_STRING
|
|
|
|
#endif
|
2008-05-20 16:13:19 +04:00
|
|
|
const char opal_version_string[] = OPAL_IDENT_STRING;
|
2007-03-17 02:11:45 +03:00
|
|
|
|
2011-07-12 21:07:41 +04:00
|
|
|
int opal_initialized = 0;
|
|
|
|
int opal_util_initialized = 0;
|
2012-04-24 21:31:06 +04:00
|
|
|
/* We have to put a guess in here in case hwloc is not available. If
|
|
|
|
hwloc is available, this value will be overwritten when the
|
|
|
|
hwloc data is loaded. */
|
|
|
|
int opal_cache_line_size = 128;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
bool opal_warn_on_fork;
|
2006-08-22 00:07:38 +04:00
|
|
|
|
2011-02-13 19:09:17 +03:00
|
|
|
static int
|
|
|
|
opal_err2str(int errnum, const char **errmsg)
|
2005-08-22 07:05:39 +04:00
|
|
|
{
|
|
|
|
const char *retval;
|
|
|
|
|
2012-04-06 18:23:13 +04:00
|
|
|
switch (errnum) {
|
2005-08-22 07:05:39 +04:00
|
|
|
case OPAL_SUCCESS:
|
|
|
|
retval = "Success";
|
|
|
|
break;
|
|
|
|
case OPAL_ERROR:
|
|
|
|
retval = "Error";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_OUT_OF_RESOURCE:
|
|
|
|
retval = "Out of resource";
|
|
|
|
break;
|
2005-12-21 09:27:34 +03:00
|
|
|
case OPAL_ERR_TEMP_OUT_OF_RESOURCE:
|
|
|
|
retval = "Temporarily out of resource";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_RESOURCE_BUSY:
|
|
|
|
retval = "Resource busy";
|
2005-08-22 07:05:39 +04:00
|
|
|
break;
|
|
|
|
case OPAL_ERR_BAD_PARAM:
|
|
|
|
retval = "Bad parameter";
|
|
|
|
break;
|
2005-12-21 09:27:34 +03:00
|
|
|
case OPAL_ERR_FATAL:
|
|
|
|
retval = "Fatal";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_IMPLEMENTED:
|
|
|
|
retval = "Not implemented";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_SUPPORTED:
|
|
|
|
retval = "Not supported";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_INTERUPTED:
|
|
|
|
retval = "Interupted";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_WOULD_BLOCK:
|
|
|
|
retval = "Would block";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_IN_ERRNO:
|
|
|
|
retval = "In errno";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNREACH:
|
|
|
|
retval = "Unreachable";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_FOUND:
|
|
|
|
retval = "Not found";
|
|
|
|
break;
|
|
|
|
case OPAL_EXISTS:
|
|
|
|
retval = "Exists";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TIMEOUT:
|
|
|
|
retval = "Timeout";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_AVAILABLE:
|
|
|
|
retval = "Not available";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PERM:
|
|
|
|
retval = "No permission";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_VALUE_OUT_OF_BOUNDS:
|
|
|
|
retval = "Value out of bounds";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_FILE_READ_FAILURE:
|
|
|
|
retval = "File read failure";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_FILE_WRITE_FAILURE:
|
|
|
|
retval = "File write failure";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_FILE_OPEN_FAILURE:
|
|
|
|
retval = "File open failure";
|
|
|
|
break;
|
2008-02-28 04:57:57 +03:00
|
|
|
case OPAL_ERR_PACK_MISMATCH:
|
|
|
|
retval = "Pack data mismatch";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PACK_FAILURE:
|
|
|
|
retval = "Data pack failed";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNPACK_FAILURE:
|
|
|
|
retval = "Data unpack failed";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNPACK_INADEQUATE_SPACE:
|
|
|
|
retval = "Data unpack had inadequate space";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNPACK_READ_PAST_END_OF_BUFFER:
|
|
|
|
retval = "Data unpack would read past end of buffer";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_OPERATION_UNSUPPORTED:
|
|
|
|
retval = "Requested operation is not supported on referenced data type";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNKNOWN_DATA_TYPE:
|
|
|
|
retval = "Unknown data type";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_BUFFER:
|
2009-04-16 20:23:28 +04:00
|
|
|
retval = "Buffer type (described vs non-described) mismatch - operation not allowed";
|
2008-02-28 04:57:57 +03:00
|
|
|
break;
|
|
|
|
case OPAL_ERR_DATA_TYPE_REDEF:
|
|
|
|
retval = "Attempt to redefine an existing data type";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_DATA_OVERWRITE_ATTEMPT:
|
|
|
|
retval = "Attempt to overwrite a data value";
|
|
|
|
break;
|
2010-05-07 00:57:17 +04:00
|
|
|
case OPAL_ERR_MODULE_NOT_FOUND:
|
|
|
|
retval = "Framework requires at least one active module, but none found";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TOPO_SLOT_LIST_NOT_SUPPORTED:
|
|
|
|
retval = "OS topology does not support slot_list process affinity";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TOPO_SOCKET_NOT_SUPPORTED:
|
|
|
|
retval = "Could not obtain socket topology information";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TOPO_CORE_NOT_SUPPORTED:
|
|
|
|
retval = "Could not obtain core topology information";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_ENOUGH_SOCKETS:
|
|
|
|
retval = "Not enough sockets to meet request";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_ENOUGH_CORES:
|
|
|
|
retval = "Not enough cores to meet request";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_INVALID_PHYS_CPU:
|
|
|
|
retval = "Invalid physical cpu number returned";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_MULTIPLE_AFFINITIES:
|
|
|
|
retval = "Multiple methods for assigning process affinity were specified";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_SLOT_LIST_RANGE:
|
|
|
|
retval = "Provided slot_list range is invalid";
|
|
|
|
break;
|
2011-06-07 06:09:11 +04:00
|
|
|
case OPAL_ERR_NETWORK_NOT_PARSEABLE:
|
|
|
|
retval = "Provided network specification is not parseable";
|
|
|
|
break;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
case OPAL_ERR_SILENT:
|
|
|
|
retval = NULL;
|
|
|
|
break;
|
2012-02-10 22:29:52 +04:00
|
|
|
case OPAL_ERR_NOT_INITIALIZED:
|
|
|
|
retval = "Not initialized";
|
|
|
|
break;
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
case OPAL_ERR_NOT_BOUND:
|
|
|
|
retval = "Not bound";
|
|
|
|
break;
|
Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data.
A few changes were required to support this move:
1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution).
2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time.
3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering.
4. the ORTE grpcomm and ess/pmi components, and the nidmap code, were updated to work with the new db framework and to specify internal/global storage options.
No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework.
This commit was SVN r28112.
2013-02-26 21:50:04 +04:00
|
|
|
case OPAL_ERR_TAKE_NEXT_OPTION:
|
|
|
|
retval = "Take next option";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PROC_ENTRY_NOT_FOUND:
|
|
|
|
retval = "Database entry not found";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_DATA_VALUE_NOT_FOUND:
|
|
|
|
retval = "Data for specified key not found";
|
|
|
|
break;
|
2014-01-20 23:58:56 +04:00
|
|
|
case OPAL_ERR_CONNECTION_FAILED:
|
|
|
|
retval = "Connection failed";
|
|
|
|
break;
|
2014-02-04 18:47:04 +04:00
|
|
|
case OPAL_ERR_AUTHENTICATION_FAILED:
|
|
|
|
retval = "Authentication failed";
|
|
|
|
break;
|
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 22:56:47 +04:00
|
|
|
case OPAL_ERR_COMM_FAILURE:
|
|
|
|
retval = "Comm failure";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_SERVER_NOT_AVAIL:
|
|
|
|
retval = "Server not available";
|
|
|
|
break;
|
2005-09-07 22:52:28 +04:00
|
|
|
default:
|
2005-08-22 07:05:39 +04:00
|
|
|
retval = NULL;
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
}
|
2005-08-22 07:05:39 +04:00
|
|
|
|
2011-02-13 19:09:17 +03:00
|
|
|
*errmsg = retval;
|
|
|
|
return OPAL_SUCCESS;
|
2005-08-22 07:05:39 +04:00
|
|
|
}
|
2005-08-18 09:34:22 +04:00
|
|
|
|
2005-05-19 17:33:55 +04:00
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
int
|
2009-12-04 03:51:15 +03:00
|
|
|
opal_init_util(int* pargc, char*** pargv)
|
2005-05-22 22:40:03 +04:00
|
|
|
{
|
2005-10-05 17:56:35 +04:00
|
|
|
int ret;
|
|
|
|
char *error = NULL;
|
|
|
|
|
2011-07-12 21:07:41 +04:00
|
|
|
if( ++opal_util_initialized != 1 ) {
|
|
|
|
if( opal_util_initialized < 1 ) {
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
2007-07-19 00:28:19 +04:00
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2005-05-22 22:40:03 +04:00
|
|
|
/* initialize the memory allocator */
|
2005-07-04 05:36:20 +04:00
|
|
|
opal_malloc_init();
|
2005-05-22 22:40:03 +04:00
|
|
|
|
|
|
|
/* initialize the output system */
|
2005-07-04 03:31:27 +04:00
|
|
|
opal_output_init();
|
2005-08-22 07:05:39 +04:00
|
|
|
|
2009-09-29 06:07:46 +04:00
|
|
|
/* initialize install dirs code */
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_installdirs_base_framework, 0))) {
|
|
|
|
fprintf(stderr, "opal_installdirs_base_open() failed -- process will likely abort (%s:%d, returned %d instead of OPAL_SUCCESS)\n",
|
2009-09-29 06:07:46 +04:00
|
|
|
__FILE__, __LINE__, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
|
|
|
/* initialize the help system */
|
|
|
|
opal_show_help_init();
|
|
|
|
|
2005-08-22 07:05:39 +04:00
|
|
|
/* register handler for errnum -> string converstion */
|
2007-08-04 04:44:23 +04:00
|
|
|
if (OPAL_SUCCESS !=
|
|
|
|
(ret = opal_error_register("OPAL",
|
|
|
|
OPAL_ERR_BASE, OPAL_ERR_MAX, opal_err2str))) {
|
2005-10-05 17:56:35 +04:00
|
|
|
error = "opal_error_register";
|
2005-11-27 00:18:47 +03:00
|
|
|
goto return_error;
|
2005-10-05 17:56:35 +04:00
|
|
|
}
|
2005-08-25 00:19:36 +04:00
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* keyval lex-based parser */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_keyval_parse_init())) {
|
|
|
|
error = "opal_util_keyval_parse_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2007-07-19 00:25:01 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_net_init())) {
|
|
|
|
error = "opal_net_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* Setup the parameter system */
|
2013-05-20 19:36:13 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_var_init())) {
|
|
|
|
error = "mca_base_var_init";
|
2005-11-27 00:18:47 +03:00
|
|
|
goto return_error;
|
2005-10-05 17:56:35 +04:00
|
|
|
}
|
2005-05-19 17:33:55 +04:00
|
|
|
|
2006-01-11 07:36:39 +03:00
|
|
|
/* register params for opal */
|
2007-11-07 04:52:23 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_register_params())) {
|
2006-01-11 07:36:39 +03:00
|
|
|
error = "opal_register_params";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* pretty-print stack handlers */
|
2006-12-03 16:59:23 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_register_stackhandlers())) {
|
|
|
|
error = "opal_util_register_stackhandlers";
|
2006-01-16 04:48:03 +03:00
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2013-04-04 20:00:17 +04:00
|
|
|
/* set system resource limits - internally protected against
|
|
|
|
* doing so twice in cases where the launch agent did it for us
|
|
|
|
*/
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_init_sys_limits(&error))) {
|
|
|
|
opal_show_help("help-opal-runtime.txt",
|
|
|
|
"opal_init:syslimit", false,
|
|
|
|
error);
|
|
|
|
return OPAL_ERR_SILENT;
|
2007-04-23 22:53:47 +04:00
|
|
|
}
|
2007-08-04 04:44:23 +04:00
|
|
|
|
2013-01-15 05:27:36 +04:00
|
|
|
/* initialize the arch string */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_arch_init ())) {
|
|
|
|
error = "opal_arch_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2009-08-03 20:46:33 +04:00
|
|
|
/* initialize the datatype engine */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_datatype_init ())) {
|
|
|
|
error = "opal_datatype_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Initialize the data storage service. */
|
2008-02-28 04:57:57 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_dss_open())) {
|
|
|
|
error = "opal_dss_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
2009-08-03 20:46:33 +04:00
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
return OPAL_SUCCESS;
|
|
|
|
|
|
|
|
return_error:
|
2013-03-20 19:30:59 +04:00
|
|
|
if (OPAL_ERR_SILENT != ret) {
|
|
|
|
opal_show_help( "help-opal-runtime.txt",
|
|
|
|
"opal_init:startup:internal-failure", true,
|
|
|
|
error, ret );
|
|
|
|
}
|
2006-01-16 04:48:03 +03:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
int
|
2009-12-04 03:51:15 +03:00
|
|
|
opal_init(int* pargc, char*** pargv)
|
2006-01-16 04:48:03 +03:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
char *error = NULL;
|
|
|
|
|
2011-07-12 21:07:41 +04:00
|
|
|
if( ++opal_initialized != 1 ) {
|
|
|
|
if( opal_initialized < 1 ) {
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
2007-06-01 06:43:46 +04:00
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* initialize util code */
|
2009-12-04 03:51:15 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_init_util(pargc, pargv))) {
|
2006-01-16 04:48:03 +03:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* initialize the mca */
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_open())) {
|
|
|
|
error = "mca_base_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2011-09-11 23:02:24 +04:00
|
|
|
/* open hwloc - since this is a static framework, no
|
|
|
|
* select is required
|
|
|
|
*/
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_hwloc_base_framework, 0))) {
|
2011-11-02 22:24:19 +04:00
|
|
|
error = "opal_hwloc_base_open";
|
2011-09-11 23:02:24 +04:00
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2006-04-05 09:57:51 +04:00
|
|
|
/* the memcpy component should be one of the first who get
|
2013-03-28 01:11:47 +04:00
|
|
|
* loaded in order to make sure we have all the available
|
2006-04-05 09:57:51 +04:00
|
|
|
* versions of memcpy correctly configured.
|
|
|
|
*/
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_memcpy_base_framework, 0))) {
|
2006-04-05 09:57:51 +04:00
|
|
|
error = "opal_memcpy_base_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2005-08-14 21:23:34 +04:00
|
|
|
/* open the memory manager components. Memory hooks may be
|
|
|
|
triggered before this (any time after mem_free_init(),
|
|
|
|
actually). This is a hook available for memory manager hooks
|
|
|
|
without good initialization routine support */
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_memory_base_framework, 0))) {
|
2005-10-05 17:56:35 +04:00
|
|
|
error = "opal_memory_base_open";
|
2005-11-27 00:18:47 +03:00
|
|
|
goto return_error;
|
2005-10-05 17:56:35 +04:00
|
|
|
}
|
2005-08-14 21:23:34 +04:00
|
|
|
|
2005-09-27 00:20:20 +04:00
|
|
|
/* initialize the memory manager / tracker */
|
2008-05-19 15:57:44 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_mem_hooks_init())) {
|
2006-12-03 16:59:23 +03:00
|
|
|
error = "opal_mem_hooks_init";
|
2005-11-27 00:18:47 +03:00
|
|
|
goto return_error;
|
2005-10-05 17:56:35 +04:00
|
|
|
}
|
2005-09-27 00:20:20 +04:00
|
|
|
|
2008-02-12 11:46:27 +03:00
|
|
|
/* initialize the memory checker, to allow early support for annotation */
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_memchecker_base_framework, 0))) {
|
2008-02-12 11:46:27 +03:00
|
|
|
error = "opal_memchecker_base_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* select the memory checker */
|
2008-05-19 15:57:44 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_memchecker_base_select())) {
|
2008-02-12 11:46:27 +03:00
|
|
|
error = "opal_memchecker_base_select";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_backtrace_base_framework, 0))) {
|
2006-09-26 03:41:06 +04:00
|
|
|
error = "opal_backtrace_base_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_timer_base_framework, 0))) {
|
2005-10-05 17:56:35 +04:00
|
|
|
error = "opal_timer_base_open";
|
2005-11-27 00:18:47 +03:00
|
|
|
goto return_error;
|
2005-10-05 17:56:35 +04:00
|
|
|
}
|
2006-01-11 07:36:39 +03:00
|
|
|
|
2007-05-25 01:54:58 +04:00
|
|
|
/*
|
2014-07-15 09:04:29 +04:00
|
|
|
* Need to start the event and progress engines if none else is.
|
2007-05-25 01:54:58 +04:00
|
|
|
* opal_cr_init uses the progress engine, so it is lumped together
|
|
|
|
* into this set as well.
|
|
|
|
*/
|
|
|
|
/*
|
|
|
|
* Initialize the event library
|
|
|
|
*/
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_event_base_framework, 0))) {
|
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 22:35:54 +04:00
|
|
|
error = "opal_event_base_open";
|
2007-05-25 01:54:58 +04:00
|
|
|
goto return_error;
|
|
|
|
}
|
2007-03-17 02:11:45 +03:00
|
|
|
|
2007-05-25 01:54:58 +04:00
|
|
|
/*
|
2008-02-12 19:59:59 +03:00
|
|
|
* Initialize the general progress engine
|
2007-05-25 01:54:58 +04:00
|
|
|
*/
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_progress_init())) {
|
|
|
|
error = "opal_progress_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
/* we want to tick the event library whenever possible */
|
|
|
|
opal_progress_event_users_increment();
|
|
|
|
|
2011-06-21 19:41:57 +04:00
|
|
|
/* setup the shmem framework */
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_shmem_base_framework, 0))) {
|
2011-06-21 19:41:57 +04:00
|
|
|
error = "opal_shmem_base_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_shmem_base_select())) {
|
|
|
|
error = "opal_shmem_base_select";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
|
|
|
#if OPAL_ENABLE_FT_CR == 1
|
|
|
|
/*
|
|
|
|
* Initialize the compression framework
|
|
|
|
* Note: Currently only used in C/R so it has been marked to only
|
|
|
|
* initialize when C/R is enabled. If other places in the code
|
|
|
|
* wish to use this framework, it is safe to remove the protection.
|
|
|
|
*/
|
2013-03-28 01:11:47 +04:00
|
|
|
if( OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_compress_base_framework, 0)) ) {
|
2011-07-14 03:34:34 +04:00
|
|
|
error = "opal_compress_base_open";
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
if( OPAL_SUCCESS != (ret = opal_compress_base_select()) ) {
|
2011-07-14 03:34:34 +04:00
|
|
|
error = "opal_compress_base_select";
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2007-05-25 01:54:58 +04:00
|
|
|
/*
|
|
|
|
* Initalize the checkpoint/restart functionality
|
|
|
|
* Note: Always do this so we can detect if the user
|
|
|
|
* attempts to checkpoint a non checkpointable job,
|
|
|
|
* otherwise the tools may hang or not clean up properly.
|
|
|
|
*/
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_cr_init() ) ) {
|
2011-07-14 03:34:34 +04:00
|
|
|
error = "opal_cr_init";
|
2007-05-25 01:54:58 +04:00
|
|
|
goto return_error;
|
2007-03-17 02:11:45 +03:00
|
|
|
}
|
|
|
|
|
2014-02-04 05:38:45 +04:00
|
|
|
/* initialize the security framework */
|
|
|
|
if( OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_sec_base_framework, 0)) ) {
|
|
|
|
error = "opal_sec_base_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
if( OPAL_SUCCESS != (ret = opal_sec_base_select()) ) {
|
|
|
|
error = "opal_sec_base_select";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
2005-11-27 00:18:47 +03:00
|
|
|
return OPAL_SUCCESS;
|
2005-08-18 09:34:22 +04:00
|
|
|
|
2005-11-27 00:18:47 +03:00
|
|
|
return_error:
|
2008-03-06 17:44:47 +03:00
|
|
|
opal_show_help( "help-opal-runtime.txt",
|
2005-11-27 00:18:47 +03:00
|
|
|
"opal_init:startup:internal-failure", true,
|
|
|
|
error, ret );
|
2005-10-05 17:56:35 +04:00
|
|
|
return ret;
|
2005-05-22 22:40:03 +04:00
|
|
|
}
|
2005-06-08 23:03:29 +04:00
|
|
|
|
2014-01-30 15:14:36 +04:00
|
|
|
int opal_init_test(void)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
char *error;
|
|
|
|
|
|
|
|
/* initialize the memory allocator */
|
|
|
|
opal_malloc_init();
|
|
|
|
|
|
|
|
/* initialize the output system */
|
|
|
|
opal_output_init();
|
|
|
|
|
|
|
|
/* initialize install dirs code */
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_installdirs_base_framework, 0))) {
|
|
|
|
fprintf(stderr, "opal_installdirs_base_open() failed -- process will likely abort (%s:%d, returned %d instead of OPAL_SUCCESS)\n",
|
|
|
|
__FILE__, __LINE__, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* initialize the help system */
|
|
|
|
opal_show_help_init();
|
|
|
|
|
|
|
|
/* register handler for errnum -> string converstion */
|
|
|
|
if (OPAL_SUCCESS !=
|
|
|
|
(ret = opal_error_register("OPAL",
|
|
|
|
OPAL_ERR_BASE, OPAL_ERR_MAX, opal_err2str))) {
|
|
|
|
error = "opal_error_register";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* keyval lex-based parser */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_keyval_parse_init())) {
|
|
|
|
error = "opal_util_keyval_parse_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_net_init())) {
|
|
|
|
error = "opal_net_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Setup the parameter system */
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_var_init())) {
|
|
|
|
error = "mca_base_var_init";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* register params for opal */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_register_params())) {
|
|
|
|
error = "opal_register_params";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* pretty-print stack handlers */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_register_stackhandlers())) {
|
|
|
|
error = "opal_util_register_stackhandlers";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Initialize the data storage service. */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_dss_open())) {
|
|
|
|
error = "opal_dss_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* initialize the mca */
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_open())) {
|
|
|
|
error = "mca_base_open";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_register(&opal_event_base_framework, 0))) {
|
|
|
|
error = "opal_event_register";
|
|
|
|
goto return_error;
|
|
|
|
}
|
|
|
|
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
|
|
|
|
return_error:
|
|
|
|
opal_show_help( "help-opal-runtime.txt",
|
|
|
|
"opal_init:startup:internal-failure", true,
|
|
|
|
error, ret );
|
|
|
|
return ret;
|
|
|
|
}
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
|
|
|
|
#if OPAL_HAVE_POSIX_THREADS
|
|
|
|
static bool fork_warning_issued = false;
|
|
|
|
static bool atfork_called = false;
|
|
|
|
|
|
|
|
static void warn_fork_cb(void)
|
|
|
|
{
|
|
|
|
if (opal_initialized && !fork_warning_issued) {
|
|
|
|
opal_show_help("help-opal-runtime.txt", "opal_init:warn-fork", true,
|
|
|
|
OPAL_NAME_PRINT(OPAL_PROC_MY_NAME), getpid());
|
|
|
|
fork_warning_issued = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif /* OPAL_HAVE_POSIX_THREADS */
|
|
|
|
|
|
|
|
void opal_warn_fork(void)
|
|
|
|
{
|
|
|
|
#if OPAL_HAVE_POSIX_THREADS
|
|
|
|
if (opal_warn_on_fork && !atfork_called) {
|
|
|
|
pthread_atfork(warn_fork_cb, NULL, NULL);
|
|
|
|
atfork_called = true;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
}
|