2013-03-28 01:09:41 +04:00
|
|
|
/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
|
2005-05-19 17:33:55 +04:00
|
|
|
/*
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
|
|
|
* Copyright (c) 2004-2010 The Trustees of Indiana University and Indiana
|
2005-11-05 22:57:48 +03:00
|
|
|
* University Research and Technology
|
|
|
|
* Corporation. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The University of Tennessee and The University
|
|
|
|
* of Tennessee Research Foundation. All rights
|
|
|
|
* reserved.
|
2005-09-07 22:52:28 +04:00
|
|
|
* Copyright (c) 2004-2005 High Performance Computing Center Stuttgart,
|
2005-05-19 17:33:55 +04:00
|
|
|
* University of Stuttgart. All rights reserved.
|
|
|
|
* Copyright (c) 2004-2005 The Regents of the University of California.
|
|
|
|
* All rights reserved.
|
PSM/PSM2: Disable signal handler hijacking by default
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default. Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.
This may be a bit *surprising*, but is not a *problem*, per se. The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).
This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale). As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.
Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers. This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).
This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present. Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).
This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):
* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1
If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-14 17:45:06 +03:00
|
|
|
* Copyright (c) 2007-2016 Cisco Systems, Inc. All rights reserved.
|
2007-11-03 05:40:22 +03:00
|
|
|
* Copyright (c) 2007 Sun Microsystems, Inc. All rights reserved.
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
* Copyright (c) 2009 Oak Ridge National Labs. All rights reserved.
|
2015-04-08 21:00:13 +03:00
|
|
|
* Copyright (c) 2010-2015 Los Alamos National Security, LLC.
|
2011-06-21 19:41:57 +04:00
|
|
|
* All rights reserved.
|
Move from the use of regex to compression
We've been fighting the battle of trying to create a regex generator and
parser that can handle arbitrary hostname schemes - without long-term
success. The worst of it is that there is no way of checking to see if
the computed regex is correct short of parsing it and doing a
character-by-character comparison with the original string. Ugh...there
has to be a better solution.
One option is to investigate using 3rd-party regex libraries as
those are coming from communities whose sole focus is resolving that
problem. However, someone would need to spend the time to investigate
it, and we'd have to find a license-friendly implementation.
Another option is to quit beating our heads against the wall and just
compress the information. It won't be as much of a reduction, but we
also won't keep hitting scenarios where things break. In this case, it
seems that "perfection" is definitely the enemy of "good enough".
This PR implements the compression option while retaining the
possibility of people adding regex-generating components. The
compression code used in ORTE is consolidated into the opal/compress
framework. That framework currently held bzip and gzip components for
use in compressing checkpoint files - since we no longer support C/R, I
have .opal_ignore'd those components.
However, I have left the original framework APIs alone in case someone
ever decides to redo C/R. The APIs of interest here are added to the
framework - specifically, the "compress_block" and "decompress_block"
functions. I then moved the ORTE zlib compression code into a new
component in this framework.
Unfortunately, the framework currently is a single-select one - i.e.,
only one active component at a time. Since I .opal_ignore'd the other
two and made the priority of zlib high, this isn't a problem. However,
if someone wants to re-enable bzip/gzip or add another component, they
might need to transition opal/compress to a multi-select framework.
Included changes:
* Consolidate the compression code into the opal/compress framework
* Move the ORTE zlib compression code into a new opal/compress/zlib
component
* Ignore the bzip and gzip components in opal/compress framework
* Add a "compress_base_limit" MCA param to set the threshold above which
we compress data - defaults to 4096 bytes
* Delete stale brucks and rcd components from orte/grpcomm framework
* Delete the orte/regx framework
* Update the launch system to use opal/compress instead of string regex
* Provide a default module if no zlib is available
* Fix some misc multi-node issues
* Properly generate the nidmap in response to a "connection warmup"
message so the remote daemon knows the children it needs to launch.
* Remove stale references to orte_node_regex
* opal_byte_object_t's are not OPAL objects - properly release allocated
memory.
* Set the topology
* Currently only handling homogeneous case
* Update the compress framework files to conform
* Consolidate open/close into one "frame" file. Ensure we open/close the
framework
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-30 03:02:21 +03:00
|
|
|
* Copyright (c) 2013-2019 Intel, Inc. All rights reserved.
|
2016-10-26 09:38:45 +03:00
|
|
|
* Copyright (c) 2015-2017 Research Organization for Information Science
|
2014-10-21 14:49:58 +04:00
|
|
|
* and Technology (RIST). All rights reserved.
|
2017-09-13 21:43:15 +03:00
|
|
|
* Copyright (c) 2017 Amazon.com, Inc. or its affiliates.
|
|
|
|
* All Rights reserved.
|
2018-03-20 07:24:17 +03:00
|
|
|
* Copyright (c) 2018 Mellanox Technologies, Inc.
|
|
|
|
* All rights reserved.
|
2019-01-09 22:38:12 +03:00
|
|
|
* Copyright (c) 2018-2019 Triad National Security, LLC. All rights
|
2018-11-29 01:52:52 +03:00
|
|
|
* reserved.
|
2005-05-19 17:33:55 +04:00
|
|
|
* $COPYRIGHT$
|
2005-09-07 22:52:28 +04:00
|
|
|
*
|
2005-05-19 17:33:55 +04:00
|
|
|
* Additional copyrights may follow
|
2005-09-07 22:52:28 +04:00
|
|
|
*
|
2005-05-19 17:33:55 +04:00
|
|
|
* $HEADER$
|
|
|
|
*/
|
|
|
|
|
|
|
|
/** @file **/
|
|
|
|
|
2015-12-24 08:40:33 +03:00
|
|
|
#ifdef HAVE_UNISTD_H
|
|
|
|
#include <unistd.h>
|
|
|
|
#endif
|
|
|
|
|
2006-02-12 04:33:29 +03:00
|
|
|
#include "opal_config.h"
|
2005-05-22 22:40:03 +04:00
|
|
|
|
2005-07-04 05:36:20 +04:00
|
|
|
#include "opal/util/malloc.h"
|
2013-01-15 05:27:36 +04:00
|
|
|
#include "opal/util/arch.h"
|
2005-07-04 03:31:27 +04:00
|
|
|
#include "opal/util/output.h"
|
2005-10-05 17:56:35 +04:00
|
|
|
#include "opal/util/show_help.h"
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
#include "opal/util/proc.h"
|
2005-11-11 03:26:27 +03:00
|
|
|
#include "opal/memoryhooks/memory.h"
|
2005-08-13 00:46:25 +04:00
|
|
|
#include "opal/mca/base/base.h"
|
2016-05-07 14:12:01 +03:00
|
|
|
#include "opal/mca/base/mca_base_var.h"
|
2005-08-13 00:46:25 +04:00
|
|
|
#include "opal/runtime/opal.h"
|
2007-07-19 00:25:01 +04:00
|
|
|
#include "opal/util/net.h"
|
- Split the datatype engine into two parts: an MPI specific part in
OMPI
and a language agnostic part in OPAL. The convertor is completely
moved into OPAL. This offers several benefits as described in RFC
http://www.open-mpi.org/community/lists/devel/2009/07/6387.php
namely:
- Fewer basic types (int* and float* types, boolean and wchar
- Fixing naming scheme to ompi-nomenclature.
- Usability outside of the ompi-layer.
- Due to the fixed nature of simple opal types, their information is
completely
known at compile time and therefore constified
- With fewer datatypes (22), the actual sizes of bit-field types may be
reduced
from 64 to 32 bits, allowing reorganizing the opal_datatype
structure, eliminating holes and keeping data required in convertor
(upon send/recv) in one cacheline...
This has implications to the convertor-datastructure and other parts
of the code.
- Several performance tests have been run, the netpipe latency does not
change with
this patch on Linux/x86-64 on the smoky cluster.
- Extensive tests have been done to verify correctness (no new
regressions) using:
1. mpi_test_suite on linux/x86-64 using clean ompi-trunk and
ompi-ddt:
a. running both trunk and ompi-ddt resulted in no differences
(except for MPI_SHORT_INT and MPI_TYPE_MIX_LB_UB do now run
correctly).
b. with --enable-memchecker and running under valgrind (one buglet
when run with static found in test-suite, commited)
2. ibm testsuite on linux/x86-64 using clean ompi-trunk and ompi-ddt:
all passed (except for the dynamic/ tests failed!! as trunk/MTT)
3. compilation and usage of HDF5 tests on Jaguar using PGI and
PathScale compilers.
4. compilation and usage on Scicortex.
- Please note, that for the heterogeneous case, (-m32 compiled
binaries/ompi), neither
ompi-trunk, nor ompi-ddt branch would successfully launch.
This commit was SVN r21641.
2009-07-13 08:56:31 +04:00
|
|
|
#include "opal/datatype/opal_datatype.h"
|
2007-04-21 04:15:05 +04:00
|
|
|
#include "opal/mca/installdirs/base/base.h"
|
2005-08-14 21:23:34 +04:00
|
|
|
#include "opal/mca/memory/base/base.h"
|
2016-03-29 07:35:18 +03:00
|
|
|
#include "opal/mca/patcher/base/base.h"
|
2006-04-05 09:57:51 +04:00
|
|
|
#include "opal/mca/memcpy/base/base.h"
|
2011-09-11 23:02:24 +04:00
|
|
|
#include "opal/mca/hwloc/base/base.h"
|
2017-09-13 21:43:15 +03:00
|
|
|
#include "opal/mca/reachable/base/base.h"
|
2005-08-18 09:34:22 +04:00
|
|
|
#include "opal/mca/timer/base/base.h"
|
2008-02-12 11:46:27 +03:00
|
|
|
#include "opal/mca/memchecker/base/base.h"
|
2016-10-03 10:32:22 +03:00
|
|
|
#include "opal/mca/if/base/base.h"
|
2008-02-28 04:57:57 +03:00
|
|
|
#include "opal/dss/dss.h"
|
2011-06-21 19:41:57 +04:00
|
|
|
#include "opal/mca/shmem/base/base.h"
|
A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-11 00:51:11 +04:00
|
|
|
#include "opal/mca/compress/base/base.h"
|
2014-08-01 22:49:37 +04:00
|
|
|
#include "opal/threads/threads.h"
|
2018-11-29 01:52:52 +03:00
|
|
|
#include "opal/threads/tsd.h"
|
2007-03-17 02:11:45 +03:00
|
|
|
|
|
|
|
#include "opal/runtime/opal_cr.h"
|
|
|
|
#include "opal/mca/crs/base/base.h"
|
|
|
|
|
|
|
|
#include "opal/runtime/opal_progress.h"
|
Update libevent to the 2.0 series, currently at 2.0.7rc. We will update to their final release when it becomes available. Currently known errors exist in unused portions of the libevent code. This revision passes the IBM test suite on a Linux machine and on a standalone Mac.
This is a fairly intrusive change, but outside of the moving of opal/event to opal/mca/event, the only changes involved (a) changing all calls to opal_event functions to reflect the new framework instead, and (b) ensuring that all opal_event_t objects are properly constructed since they are now true opal_objects.
Note: Shiqing has just returned from vacation and has not yet had a chance to complete the Windows integration. Thus, this commit almost certainly breaks Windows support on the trunk. However, I want this to have a chance to soak for as long as possible before I become less available a week from today (going to be at a class for 5 days, and thus will only be sparingly available) so we can find and fix any problems.
Biggest change is moving the libevent code from opal/event to a new opal/mca/event framework. This was done to make it much easier to update libevent in the future. New versions can be inserted as a new component and tested in parallel with the current version until validated, then we can remove the earlier version if we so choose. This is a statically built framework ala installdirs, so only one component will build at a time. There is no selection logic - the sole compiled component simply loads its function pointers into the opal_event struct.
I have gone thru the code base and converted all the libevent calls I could find. However, I cannot compile nor test every environment. It is therefore quite likely that errors remain in the system. Please keep an eye open for two things:
1. compile-time errors: these will be obvious as calls to the old functions (e.g., opal_evtimer_new) must be replaced by the new framework APIs (e.g., opal_event.evtimer_new)
2. run-time errors: these will likely show up as segfaults due to missing constructors on opal_event_t objects. It appears that it became a typical practice for people to "init" an opal_event_t by simply using memset to zero it out. This will no longer work - you must either OBJ_NEW or OBJ_CONSTRUCT an opal_event_t. I tried to catch these cases, but may have missed some. Believe me, you'll know when you hit it.
There is also the issue of the new libevent "no recursion" behavior. As I described on a recent email, we will have to discuss this and figure out what, if anything, we need to do.
This commit was SVN r23925.
2010-10-24 22:35:54 +04:00
|
|
|
#include "opal/mca/event/base/base.h"
|
2006-09-26 03:41:06 +04:00
|
|
|
#include "opal/mca/backtrace/base/base.h"
|
2007-03-17 02:11:45 +03:00
|
|
|
|
2006-02-12 04:33:29 +03:00
|
|
|
#include "opal/constants.h"
|
2005-08-22 07:05:39 +04:00
|
|
|
#include "opal/util/error.h"
|
2006-01-11 07:36:39 +03:00
|
|
|
#include "opal/util/stacktrace.h"
|
2006-01-16 04:48:03 +03:00
|
|
|
#include "opal/util/keyval_parse.h"
|
2007-04-23 22:53:47 +04:00
|
|
|
#include "opal/util/sys_limits.h"
|
2017-04-13 17:45:37 +03:00
|
|
|
#include "opal/util/timings.h"
|
2005-09-07 22:52:28 +04:00
|
|
|
|
2009-05-07 00:11:28 +04:00
|
|
|
#if OPAL_CC_USE_PRAGMA_IDENT
|
2007-11-03 05:40:22 +03:00
|
|
|
#pragma ident OPAL_IDENT_STRING
|
2009-05-07 00:11:28 +04:00
|
|
|
#elif OPAL_CC_USE_IDENT
|
2007-11-03 05:40:22 +03:00
|
|
|
#ident OPAL_IDENT_STRING
|
|
|
|
#endif
|
2008-05-20 16:13:19 +04:00
|
|
|
const char opal_version_string[] = OPAL_IDENT_STRING;
|
2007-03-17 02:11:45 +03:00
|
|
|
|
2011-07-12 21:07:41 +04:00
|
|
|
int opal_initialized = 0;
|
2015-04-15 00:34:21 +03:00
|
|
|
bool opal_init_called = false;
|
2011-07-12 21:07:41 +04:00
|
|
|
int opal_util_initialized = 0;
|
2012-04-24 21:31:06 +04:00
|
|
|
/* We have to put a guess in here in case hwloc is not available. If
|
|
|
|
hwloc is available, this value will be overwritten when the
|
|
|
|
hwloc data is loaded. */
|
|
|
|
int opal_cache_line_size = 128;
|
2015-04-15 09:14:57 +03:00
|
|
|
bool opal_warn_on_fork = true;
|
2006-08-22 00:07:38 +04:00
|
|
|
|
2011-02-13 19:09:17 +03:00
|
|
|
static int
|
|
|
|
opal_err2str(int errnum, const char **errmsg)
|
2005-08-22 07:05:39 +04:00
|
|
|
{
|
|
|
|
const char *retval;
|
2017-09-21 20:26:41 +03:00
|
|
|
|
2012-04-06 18:23:13 +04:00
|
|
|
switch (errnum) {
|
2005-08-22 07:05:39 +04:00
|
|
|
case OPAL_SUCCESS:
|
|
|
|
retval = "Success";
|
|
|
|
break;
|
|
|
|
case OPAL_ERROR:
|
|
|
|
retval = "Error";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_OUT_OF_RESOURCE:
|
|
|
|
retval = "Out of resource";
|
|
|
|
break;
|
2005-12-21 09:27:34 +03:00
|
|
|
case OPAL_ERR_TEMP_OUT_OF_RESOURCE:
|
|
|
|
retval = "Temporarily out of resource";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_RESOURCE_BUSY:
|
|
|
|
retval = "Resource busy";
|
2005-08-22 07:05:39 +04:00
|
|
|
break;
|
|
|
|
case OPAL_ERR_BAD_PARAM:
|
|
|
|
retval = "Bad parameter";
|
|
|
|
break;
|
2005-12-21 09:27:34 +03:00
|
|
|
case OPAL_ERR_FATAL:
|
|
|
|
retval = "Fatal";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_IMPLEMENTED:
|
|
|
|
retval = "Not implemented";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_SUPPORTED:
|
|
|
|
retval = "Not supported";
|
|
|
|
break;
|
2015-02-28 05:30:43 +03:00
|
|
|
case OPAL_ERR_INTERRUPTED:
|
|
|
|
retval = "Interrupted";
|
2005-12-21 09:27:34 +03:00
|
|
|
break;
|
|
|
|
case OPAL_ERR_WOULD_BLOCK:
|
|
|
|
retval = "Would block";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_IN_ERRNO:
|
|
|
|
retval = "In errno";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNREACH:
|
|
|
|
retval = "Unreachable";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_FOUND:
|
|
|
|
retval = "Not found";
|
|
|
|
break;
|
|
|
|
case OPAL_EXISTS:
|
|
|
|
retval = "Exists";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TIMEOUT:
|
|
|
|
retval = "Timeout";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_AVAILABLE:
|
|
|
|
retval = "Not available";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PERM:
|
|
|
|
retval = "No permission";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_VALUE_OUT_OF_BOUNDS:
|
|
|
|
retval = "Value out of bounds";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_FILE_READ_FAILURE:
|
|
|
|
retval = "File read failure";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_FILE_WRITE_FAILURE:
|
|
|
|
retval = "File write failure";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_FILE_OPEN_FAILURE:
|
|
|
|
retval = "File open failure";
|
|
|
|
break;
|
2008-02-28 04:57:57 +03:00
|
|
|
case OPAL_ERR_PACK_MISMATCH:
|
|
|
|
retval = "Pack data mismatch";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PACK_FAILURE:
|
|
|
|
retval = "Data pack failed";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNPACK_FAILURE:
|
|
|
|
retval = "Data unpack failed";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNPACK_INADEQUATE_SPACE:
|
|
|
|
retval = "Data unpack had inadequate space";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNPACK_READ_PAST_END_OF_BUFFER:
|
|
|
|
retval = "Data unpack would read past end of buffer";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_OPERATION_UNSUPPORTED:
|
|
|
|
retval = "Requested operation is not supported on referenced data type";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_UNKNOWN_DATA_TYPE:
|
|
|
|
retval = "Unknown data type";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_BUFFER:
|
2009-04-16 20:23:28 +04:00
|
|
|
retval = "Buffer type (described vs non-described) mismatch - operation not allowed";
|
2008-02-28 04:57:57 +03:00
|
|
|
break;
|
|
|
|
case OPAL_ERR_DATA_TYPE_REDEF:
|
|
|
|
retval = "Attempt to redefine an existing data type";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_DATA_OVERWRITE_ATTEMPT:
|
|
|
|
retval = "Attempt to overwrite a data value";
|
|
|
|
break;
|
2010-05-07 00:57:17 +04:00
|
|
|
case OPAL_ERR_MODULE_NOT_FOUND:
|
|
|
|
retval = "Framework requires at least one active module, but none found";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TOPO_SLOT_LIST_NOT_SUPPORTED:
|
|
|
|
retval = "OS topology does not support slot_list process affinity";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TOPO_SOCKET_NOT_SUPPORTED:
|
|
|
|
retval = "Could not obtain socket topology information";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_TOPO_CORE_NOT_SUPPORTED:
|
|
|
|
retval = "Could not obtain core topology information";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_ENOUGH_SOCKETS:
|
|
|
|
retval = "Not enough sockets to meet request";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NOT_ENOUGH_CORES:
|
|
|
|
retval = "Not enough cores to meet request";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_INVALID_PHYS_CPU:
|
|
|
|
retval = "Invalid physical cpu number returned";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_MULTIPLE_AFFINITIES:
|
|
|
|
retval = "Multiple methods for assigning process affinity were specified";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_SLOT_LIST_RANGE:
|
|
|
|
retval = "Provided slot_list range is invalid";
|
|
|
|
break;
|
2011-06-07 06:09:11 +04:00
|
|
|
case OPAL_ERR_NETWORK_NOT_PARSEABLE:
|
|
|
|
retval = "Provided network specification is not parseable";
|
|
|
|
break;
|
At long last, the fabled revision to the affinity system has arrived. A more detailed explanation of how this all works will be presented here:
https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
The wiki page is incomplete at the moment, but I hope to complete it over the next few days. I will provide updates on the devel list. As the wiki page states, the default and most commonly used options remain unchanged (except as noted below). New, esoteric and complex options have been added, but unless you are a true masochist, you are unlikely to use many of them beyond perhaps an initial curiosity-motivated experimentation.
In a nutshell, this commit revamps the map/rank/bind procedure to take into account topology info on the compute nodes. I have, for the most part, preserved the default behaviors, with three notable exceptions:
1. I have at long last bowed my head in submission to the system admin's of managed clusters. For years, they have complained about our default of allowing users to oversubscribe nodes - i.e., to run more processes on a node than allocated slots. Accordingly, I have modified the default behavior: if you are running off of hostfile/dash-host allocated nodes, then the default is to allow oversubscription. If you are running off of RM-allocated nodes, then the default is to NOT allow oversubscription. Flags to override these behaviors are provided, so this only affects the default behavior.
2. both cpus/rank and stride have been removed. The latter was demanded by those who didn't understand the purpose behind it - and I agreed as the users who requested it are no longer using it. The former was removed temporarily pending implementation.
3. vm launch is now the sole method for starting OMPI. It was just too darned hard to maintain multiple launch procedures - maybe someday, provided someone can demonstrate a reason to do so.
As Jeff stated, it is impossible to fully test a change of this size. I have tested it on Linux and Mac, covering all the default and simple options, singletons, and comm_spawn. That said, I'm sure others will find problems, so I'll be watching MTT results until this stabilizes.
This commit was SVN r25476.
2011-11-15 07:40:11 +04:00
|
|
|
case OPAL_ERR_SILENT:
|
|
|
|
retval = NULL;
|
|
|
|
break;
|
2012-02-10 22:29:52 +04:00
|
|
|
case OPAL_ERR_NOT_INITIALIZED:
|
|
|
|
retval = "Not initialized";
|
|
|
|
break;
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
case OPAL_ERR_NOT_BOUND:
|
|
|
|
retval = "Not bound";
|
|
|
|
break;
|
Per the meeting on moving the BTLs to OPAL, move the ORTE database "db" framework to OPAL so the relocated BTLs can access it. Because the data is indexed by process, this requires that we define a new "opal_identifier_t" that corresponds to the orte_process_name_t struct. In order to support multiple run-times, this is defined in opal/mca/db/db_types.h as a uint64_t without identifying the meaning of any part of that data.
A few changes were required to support this move:
1. the PMI component used to identify rte-related data (e.g., host name, bind level) and package them as a unit to reduce the number of PMI keys. This code was moved up to the ORTE layer as the OPAL layer has no understanding of these concepts. In addition, the component locally stored data based on process jobid/vpid - this could no longer be supported (see below for the solution).
2. the hash component was updated to use the new opal_identifier_t instead of orte_process_name_t as its index for storing data in the hash tables. Previously, we did a hash on the vpid and stored the data in a 32-bit hash table. In the revised system, we don't see a separate "vpid" field - we only have a 64-bit opaque value. The orte_process_name_t hash turned out to do nothing useful, so we now store the data in a 64-bit hash table. Preliminary tests didn't show any identifiable change in behavior or performance, but we'll have to see if a move back to the 32-bit table is required at some later time.
3. the db framework was a "select one" system. However, since the PMI component could no longer use its internal storage system, the framework has now been changed to a "select many" mode of operation. This allows the hash component to handle all internal storage, while the PMI component only handles pushing/pulling things from the PMI system. This was something we had planned for some time - when fetching data, we first check internal storage to see if we already have it, and then automatically go to the global system to look for it if we don't. Accordingly, the framework was provided with a custom query function used during "select" that lets you seperately specify the "store" and "fetch" ordering.
4. the ORTE grpcomm and ess/pmi components, and the nidmap code, were updated to work with the new db framework and to specify internal/global storage options.
No changes were made to the MPI layer, except for modifying the ORTE component of the OMPI/rte framework to support the new db framework.
This commit was SVN r28112.
2013-02-26 21:50:04 +04:00
|
|
|
case OPAL_ERR_TAKE_NEXT_OPTION:
|
|
|
|
retval = "Take next option";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PROC_ENTRY_NOT_FOUND:
|
|
|
|
retval = "Database entry not found";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_DATA_VALUE_NOT_FOUND:
|
|
|
|
retval = "Data for specified key not found";
|
|
|
|
break;
|
2014-01-20 23:58:56 +04:00
|
|
|
case OPAL_ERR_CONNECTION_FAILED:
|
|
|
|
retval = "Connection failed";
|
|
|
|
break;
|
2014-02-04 18:47:04 +04:00
|
|
|
case OPAL_ERR_AUTHENTICATION_FAILED:
|
|
|
|
retval = "Authentication failed";
|
|
|
|
break;
|
Per the PMIx RFC:
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
2014-08-21 22:56:47 +04:00
|
|
|
case OPAL_ERR_COMM_FAILURE:
|
|
|
|
retval = "Comm failure";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_SERVER_NOT_AVAIL:
|
|
|
|
retval = "Server not available";
|
|
|
|
break;
|
2016-01-11 19:46:31 +03:00
|
|
|
case OPAL_ERR_IN_PROCESS:
|
|
|
|
retval = "Operation in process";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_DEBUGGER_RELEASE:
|
|
|
|
retval = "Release debugger";
|
|
|
|
break;
|
2016-06-02 02:34:03 +03:00
|
|
|
case OPAL_ERR_HANDLERS_COMPLETE:
|
2016-07-13 23:28:56 +03:00
|
|
|
retval = "Event handlers complete";
|
2016-06-02 02:34:03 +03:00
|
|
|
break;
|
2016-06-18 01:15:13 +03:00
|
|
|
case OPAL_ERR_PARTIAL_SUCCESS:
|
|
|
|
retval = "Partial success";
|
|
|
|
break;
|
2016-07-13 23:28:56 +03:00
|
|
|
case OPAL_ERR_PROC_ABORTED:
|
|
|
|
retval = "Process abnormally terminated";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PROC_REQUESTED_ABORT:
|
|
|
|
retval = "Process requested abort";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PROC_ABORTING:
|
|
|
|
retval = "Process is aborting";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NODE_DOWN:
|
|
|
|
retval = "Node has gone down";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_NODE_OFFLINE:
|
|
|
|
retval = "Node has gone offline";
|
|
|
|
break;
|
2016-09-01 21:41:13 +03:00
|
|
|
case OPAL_ERR_JOB_TERMINATED:
|
|
|
|
retval = "Job terminated";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PROC_RESTART:
|
|
|
|
retval = "Process restarted";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PROC_CHECKPOINT:
|
|
|
|
retval = "Process checkpoint";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_PROC_MIGRATE:
|
|
|
|
retval = "Process migrate";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_EVENT_REGISTRATION:
|
|
|
|
retval = "Event registration";
|
|
|
|
break;
|
2017-03-15 07:44:05 +03:00
|
|
|
case OPAL_ERR_HEARTBEAT_ALERT:
|
|
|
|
retval = "Heartbeat not received";
|
|
|
|
break;
|
|
|
|
case OPAL_ERR_FILE_ALERT:
|
|
|
|
retval = "File alert - proc may have stalled";
|
|
|
|
break;
|
2017-06-07 02:10:52 +03:00
|
|
|
case OPAL_ERR_MODEL_DECLARED:
|
|
|
|
retval = "Model declared";
|
|
|
|
break;
|
2018-02-23 20:57:19 +03:00
|
|
|
case OPAL_PMIX_LAUNCH_DIRECTIVE:
|
|
|
|
retval = "Launch directive";
|
|
|
|
break;
|
|
|
|
|
2005-09-07 22:52:28 +04:00
|
|
|
default:
|
2016-01-11 19:46:31 +03:00
|
|
|
retval = "UNRECOGNIZED";
|
Per RFC, bring in the following changes:
* Remove paffinity, maffinity, and carto frameworks -- they've been
wholly replaced by hwloc.
* Move ompi_mpi_init() affinity-setting/checking code down to ORTE.
* Update sm, smcuda, wv, and openib components to no longer use carto.
Instead, use hwloc data. There are still optimizations possible in
the sm/smcuda BTLs (i.e., making multiple mpools). Also, the old
carto-based code found out how many NUMA nodes were ''available''
-- not how many were used ''in this job''. The new hwloc-using
code computes the same value -- it was not updated to calculate how
many NUMA nodes are used ''by this job.''
* Note that I cannot compile the smcuda and wv BTLs -- I ''think''
they're right, but they need to be verified by their owners.
* The openib component now does a bunch of stuff to figure out where
"near" OpenFabrics devices are. '''THIS IS A CHANGE IN DEFAULT
BEHAVIOR!!''' and still needs to be verified by OpenFabrics vendors
(I do not have a NUMA machine with an OpenFabrics device that is a
non-uniform distance from multiple different NUMA nodes).
* Completely rewrite the OMPI_Affinity_str() routine from the
"affinity" mpiext extension. This extension now understands
hyperthreads; the output format of it has changed a bit to reflect
this new information.
* Bunches of minor changes around the code base to update names/types
from maffinity/paffinity-based names to hwloc-based names.
* Add some helper functions into the hwloc base, mainly having to do
with the fact that we have the hwloc data reporting ''all''
topology information, but sometimes you really only want the
(online | available) data.
This commit was SVN r26391.
2012-05-07 18:52:54 +04:00
|
|
|
}
|
2005-08-22 07:05:39 +04:00
|
|
|
|
2011-02-13 19:09:17 +03:00
|
|
|
*errmsg = retval;
|
|
|
|
return OPAL_SUCCESS;
|
2005-08-22 07:05:39 +04:00
|
|
|
}
|
2005-08-18 09:34:22 +04:00
|
|
|
|
2005-05-19 17:33:55 +04:00
|
|
|
|
PSM/PSM2: Disable signal handler hijacking by default
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default. Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.
This may be a bit *surprising*, but is not a *problem*, per se. The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).
This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale). As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.
Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers. This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).
This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present. Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).
This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):
* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1
If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-14 17:45:06 +03:00
|
|
|
int opal_init_psm(void)
|
|
|
|
{
|
|
|
|
/* Very early in the init sequence -- before *ANY* MCA components
|
|
|
|
are opened -- we need to disable some behavior from the PSM and
|
|
|
|
PSM2 libraries (by default): at least some old versions of
|
|
|
|
these libraries hijack signal handlers during their library
|
|
|
|
constructors and then do not un-hijack them when the libraries
|
|
|
|
are unloaded.
|
|
|
|
|
|
|
|
It is a bit of an abstraction break that we have to put
|
|
|
|
vendor/transport-specific code in the OPAL core, but we're
|
|
|
|
out of options, unfortunately.
|
|
|
|
|
|
|
|
NOTE: We only disable this behavior if the corresponding
|
|
|
|
environment variables are not already set (i.e., if the
|
|
|
|
user/environment has indicated a preference for this behavior,
|
|
|
|
we won't override it). */
|
|
|
|
if (NULL == getenv("IPATH_NO_BACKTRACE")) {
|
|
|
|
opal_setenv("IPATH_NO_BACKTRACE", "1", true, &environ);
|
|
|
|
}
|
|
|
|
if (NULL == getenv("HFI_NO_BACKTRACE")) {
|
|
|
|
opal_setenv("HFI_NO_BACKTRACE", "1", true, &environ);
|
|
|
|
}
|
|
|
|
|
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2018-11-29 01:52:52 +03:00
|
|
|
static int opal_init_error (const char *error, int ret)
|
|
|
|
{
|
|
|
|
if (OPAL_ERR_SILENT != ret) {
|
|
|
|
opal_show_help( "help-opal-runtime.txt",
|
|
|
|
"opal_init:startup:internal-failure", true,
|
|
|
|
error, ret );
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static mca_base_framework_t *opal_init_util_frameworks[] = {
|
|
|
|
&opal_installdirs_base_framework, &opal_if_base_framework, NULL,
|
|
|
|
};
|
PSM/PSM2: Disable signal handler hijacking by default
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default. Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.
This may be a bit *surprising*, but is not a *problem*, per se. The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).
This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale). As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.
Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers. This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).
This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present. Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).
This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):
* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1
If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-14 17:45:06 +03:00
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
int
|
2009-12-04 03:51:15 +03:00
|
|
|
opal_init_util(int* pargc, char*** pargv)
|
2005-05-22 22:40:03 +04:00
|
|
|
{
|
2005-10-05 17:56:35 +04:00
|
|
|
int ret;
|
|
|
|
char *error = NULL;
|
2016-04-16 03:34:34 +03:00
|
|
|
char hostname[OPAL_MAXHOSTNAMELEN];
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_INIT(otmng);
|
2005-10-05 17:56:35 +04:00
|
|
|
|
2011-07-12 21:07:41 +04:00
|
|
|
if( ++opal_util_initialized != 1 ) {
|
|
|
|
if( opal_util_initialized < 1 ) {
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
2007-07-19 00:28:19 +04:00
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2018-11-29 01:52:52 +03:00
|
|
|
|
|
|
|
OBJ_CONSTRUCT(&opal_init_util_domain, opal_finalize_domain_t);
|
|
|
|
(void) opal_finalize_domain_init (&opal_init_util_domain, "opal_init_util");
|
|
|
|
opal_finalize_set_domain (&opal_init_util_domain);
|
|
|
|
|
2016-10-26 09:38:45 +03:00
|
|
|
opal_thread_set_main();
|
|
|
|
|
2015-04-15 00:34:21 +03:00
|
|
|
opal_init_called = true;
|
|
|
|
|
2014-10-04 01:19:48 +04:00
|
|
|
/* set the nodename right away so anyone who needs it has it. Note
|
|
|
|
* that we don't bother with fqdn and prefix issues here - we let
|
|
|
|
* the RTE later replace this with a modified name if the user
|
|
|
|
* requests it */
|
2016-04-16 03:34:34 +03:00
|
|
|
gethostname(hostname, sizeof(hostname));
|
2014-10-04 01:19:48 +04:00
|
|
|
opal_process_info.nodename = strdup(hostname);
|
|
|
|
|
2005-05-22 22:40:03 +04:00
|
|
|
/* initialize the memory allocator */
|
2005-07-04 05:36:20 +04:00
|
|
|
opal_malloc_init();
|
2005-05-22 22:40:03 +04:00
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_malloc_init");
|
|
|
|
|
2005-05-22 22:40:03 +04:00
|
|
|
/* initialize the output system */
|
2005-07-04 03:31:27 +04:00
|
|
|
opal_output_init();
|
2005-08-22 07:05:39 +04:00
|
|
|
|
2009-09-29 06:07:46 +04:00
|
|
|
/* initialize install dirs code */
|
2013-03-28 01:11:47 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_installdirs_base_framework, 0))) {
|
|
|
|
fprintf(stderr, "opal_installdirs_base_open() failed -- process will likely abort (%s:%d, returned %d instead of OPAL_SUCCESS)\n",
|
2009-09-29 06:07:46 +04:00
|
|
|
__FILE__, __LINE__, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
This commit represents a bunch of work on a Mercurial side branch. As
such, the commit message back to the master SVN repository is fairly
long.
= ORTE Job-Level Output Messages =
Add two new interfaces that should be used for all new code throughout
the ORTE and OMPI layers (we already make the search-and-replace on
the existing ORTE / OMPI layers):
* orte_output(): (and corresponding friends ORTE_OUTPUT,
orte_output_verbose, etc.) This function sends the output directly
to the HNP for processing as part of a job-specific output
channel. It supports all the same outputs as opal_output()
(syslog, file, stdout, stderr), but for stdout/stderr, the output
is sent to the HNP for processing and output. More on this below.
* orte_show_help(): This function is a drop-in-replacement for
opal_show_help(), with two differences in functionality:
1. the rendered text help message output is sent to the HNP for
display (rather than outputting directly into the process' stderr
stream)
1. the HNP detects duplicate help messages and does not display them
(so that you don't see the same error message N times, once from
each of your N MPI processes); instead, it counts "new" instances
of the help message and displays a message every ~5 seconds when
there are new ones ("I got X new copies of the help message...")
opal_show_help and opal_output still exist, but they only output in
the current process. The intent for the new orte_* functions is that
they can apply job-level intelligence to the output. As such, we
recommend that all new ORTE and OMPI code use the new orte_*
functions, not thei opal_* functions.
=== New code ===
For ORTE and OMPI programmers, here's what you need to do differently
in new code:
* Do not include opal/util/show_help.h or opal/util/output.h.
Instead, include orte/util/output.h (this one header file has
declarations for both the orte_output() series of functions and
orte_show_help()).
* Effectively s/opal_output/orte_output/gi throughout your code.
Note that orte_output_open() takes a slightly different argument
list (as a way to pass data to the filtering stream -- see below),
so you if explicitly call opal_output_open(), you'll need to
slightly adapt to the new signature of orte_output_open().
* Literally s/opal_show_help/orte_show_help/. The function signature
is identical.
=== Notes ===
* orte_output'ing to stream 0 will do similar to what
opal_output'ing did, so leaving a hard-coded "0" as the first
argument is safe.
* For systems that do not use ORTE's RML or the HNP, the effect of
orte_output_* and orte_show_help will be identical to their opal
counterparts (the additional information passed to
orte_output_open() will be lost!). Indeed, the orte_* functions
simply become trivial wrappers to their opal_* counterparts. Note
that we have not tested this; the code is simple but it is quite
possible that we mucked something up.
= Filter Framework =
Messages sent view the new orte_* functions described above and
messages output via the IOF on the HNP will now optionally be passed
through a new "filter" framework before being output to
stdout/stderr. The "filter" OPAL MCA framework is intended to allow
preprocessing to messages before they are sent to their final
destinations. The first component that was written in the filter
framework was to create an XML stream, segregating all the messages
into different XML tags, etc. This will allow 3rd party tools to read
the stdout/stderr from the HNP and be able to know exactly what each
text message is (e.g., a help message, another OMPI infrastructure
message, stdout from the user process, stderr from the user process,
etc.).
Filtering is not active by default. Filter components must be
specifically requested, such as:
{{{
$ mpirun --mca filter xml ...
}}}
There can only be one filter component active.
= New MCA Parameters =
The new functionality described above introduces two new MCA
parameters:
* '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that
help messages will be aggregated, as described above. If set to 0,
all help messages will be displayed, even if they are duplicates
(i.e., the original behavior).
* '''orte_base_show_output_recursions''': An MCA parameter to help
debug one of the known issues, described below. It is likely that
this MCA parameter will disappear before v1.3 final.
= Known Issues =
* The XML filter component is not complete. The current output from
this component is preliminary and not real XML. A bit more work
needs to be done to configure.m4 search for an appropriate XML
library/link it in/use it at run time.
* There are possible recursion loops in the orte_output() and
orte_show_help() functions -- e.g., if RML send calls orte_output()
or orte_show_help(). We have some ideas how to fix these, but
figured that it was ok to commit before feature freeze with known
issues. The code currently contains sub-optimal workarounds so
that this will not be a problem, but it would be good to actually
solve the problem rather than have hackish workarounds before v1.3 final.
This commit was SVN r18434.
2008-05-14 00:00:55 +04:00
|
|
|
/* initialize the help system */
|
|
|
|
opal_show_help_init();
|
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_show_help_init");
|
|
|
|
|
2005-08-22 07:05:39 +04:00
|
|
|
/* register handler for errnum -> string converstion */
|
2015-06-24 06:59:57 +03:00
|
|
|
if (OPAL_SUCCESS !=
|
2007-08-04 04:44:23 +04:00
|
|
|
(ret = opal_error_register("OPAL",
|
|
|
|
OPAL_ERR_BASE, OPAL_ERR_MAX, opal_err2str))) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_error_register", ret);
|
2005-10-05 17:56:35 +04:00
|
|
|
}
|
2005-08-25 00:19:36 +04:00
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* keyval lex-based parser */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_keyval_parse_init())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_util_keyval_parse_init", ret);
|
2006-01-16 04:48:03 +03:00
|
|
|
}
|
|
|
|
|
PSM/PSM2: Disable signal handler hijacking by default
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default. Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.
This may be a bit *surprising*, but is not a *problem*, per se. The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).
This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale). As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.
Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers. This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).
This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present. Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).
This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):
* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1
If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2016-06-14 17:45:06 +03:00
|
|
|
// Disable PSM signal hijacking (see comment in function for more
|
|
|
|
// details)
|
|
|
|
opal_init_psm();
|
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_init_psm");
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* Setup the parameter system */
|
2013-05-20 19:36:13 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_var_init())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("mca_base_var_init", ret);
|
2005-10-05 17:56:35 +04:00
|
|
|
}
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_var_init");
|
2005-05-19 17:33:55 +04:00
|
|
|
|
2016-05-24 07:09:44 +03:00
|
|
|
/* read any param files that were provided */
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_var_cache_files(false))) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("failed to cache files", ret);
|
2016-05-24 07:09:44 +03:00
|
|
|
}
|
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_var_cache");
|
|
|
|
|
2016-05-24 07:09:44 +03:00
|
|
|
|
2006-01-11 07:36:39 +03:00
|
|
|
/* register params for opal */
|
2007-11-07 04:52:23 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_register_params())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_register_params", ret);
|
2006-01-11 07:36:39 +03:00
|
|
|
}
|
|
|
|
|
2015-04-10 00:25:58 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_net_init())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_net_init", ret);
|
2015-04-10 00:25:58 +03:00
|
|
|
}
|
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_net_init");
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* pretty-print stack handlers */
|
2006-12-03 16:59:23 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_register_stackhandlers())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_util_register_stackhandlers", ret);
|
2006-01-16 04:48:03 +03:00
|
|
|
}
|
|
|
|
|
2013-04-04 20:00:17 +04:00
|
|
|
/* set system resource limits - internally protected against
|
|
|
|
* doing so twice in cases where the launch agent did it for us
|
|
|
|
*/
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_util_init_sys_limits(&error))) {
|
|
|
|
opal_show_help("help-opal-runtime.txt",
|
|
|
|
"opal_init:syslimit", false,
|
|
|
|
error);
|
|
|
|
return OPAL_ERR_SILENT;
|
2007-04-23 22:53:47 +04:00
|
|
|
}
|
2007-08-04 04:44:23 +04:00
|
|
|
|
2013-01-15 05:27:36 +04:00
|
|
|
/* initialize the arch string */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_arch_init ())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_arch_init", ret);
|
2013-01-15 05:27:36 +04:00
|
|
|
}
|
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_arch_init");
|
|
|
|
|
2009-08-03 20:46:33 +04:00
|
|
|
/* initialize the datatype engine */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_datatype_init ())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_datatype_init", ret);
|
2009-08-03 20:46:33 +04:00
|
|
|
}
|
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_datatype_init");
|
|
|
|
|
2009-08-03 20:46:33 +04:00
|
|
|
/* Initialize the data storage service. */
|
2008-02-28 04:57:57 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_dss_open())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_dss_open", ret);
|
2008-02-28 04:57:57 +03:00
|
|
|
}
|
2009-08-03 20:46:33 +04:00
|
|
|
|
2018-03-20 07:24:17 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_dss_open");
|
|
|
|
|
2015-04-08 21:00:13 +03:00
|
|
|
/* initialize the mca */
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_open())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("mca_base_open", ret);
|
2015-04-08 21:00:13 +03:00
|
|
|
}
|
|
|
|
|
2018-03-20 07:24:17 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "mca_base_open");
|
|
|
|
|
2016-10-03 10:32:22 +03:00
|
|
|
/* initialize if framework */
|
|
|
|
if (OPAL_SUCCESS != (ret = mca_base_framework_open(&opal_if_base_framework, 0))) {
|
|
|
|
fprintf(stderr, "opal_if_base_open() failed -- process will likely abort (%s:%d, returned %d instead of OPAL_SUCCESS)\n",
|
|
|
|
__FILE__, __LINE__, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-01-09 22:38:12 +03:00
|
|
|
/* register for */
|
|
|
|
opal_finalize_register_cleanup_arg (mca_base_framework_close_list, opal_init_util_frameworks);
|
|
|
|
|
2017-04-13 17:45:37 +03:00
|
|
|
OPAL_TIMING_ENV_NEXT(otmng, "opal_if_init");
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2018-11-29 01:52:52 +03:00
|
|
|
/* the memcpy component should be one of the first who get
|
|
|
|
* loaded in order to make sure we have all the available
|
|
|
|
* versions of memcpy correctly configured.
|
|
|
|
*/
|
|
|
|
static mca_base_framework_t *opal_init_frameworks[] = {
|
|
|
|
&opal_hwloc_base_framework, &opal_memcpy_base_framework, &opal_memchecker_base_framework,
|
|
|
|
&opal_backtrace_base_framework, &opal_timer_base_framework, &opal_event_base_framework,
|
Move from the use of regex to compression
We've been fighting the battle of trying to create a regex generator and
parser that can handle arbitrary hostname schemes - without long-term
success. The worst of it is that there is no way of checking to see if
the computed regex is correct short of parsing it and doing a
character-by-character comparison with the original string. Ugh...there
has to be a better solution.
One option is to investigate using 3rd-party regex libraries as
those are coming from communities whose sole focus is resolving that
problem. However, someone would need to spend the time to investigate
it, and we'd have to find a license-friendly implementation.
Another option is to quit beating our heads against the wall and just
compress the information. It won't be as much of a reduction, but we
also won't keep hitting scenarios where things break. In this case, it
seems that "perfection" is definitely the enemy of "good enough".
This PR implements the compression option while retaining the
possibility of people adding regex-generating components. The
compression code used in ORTE is consolidated into the opal/compress
framework. That framework currently held bzip and gzip components for
use in compressing checkpoint files - since we no longer support C/R, I
have .opal_ignore'd those components.
However, I have left the original framework APIs alone in case someone
ever decides to redo C/R. The APIs of interest here are added to the
framework - specifically, the "compress_block" and "decompress_block"
functions. I then moved the ORTE zlib compression code into a new
component in this framework.
Unfortunately, the framework currently is a single-select one - i.e.,
only one active component at a time. Since I .opal_ignore'd the other
two and made the priority of zlib high, this isn't a problem. However,
if someone wants to re-enable bzip/gzip or add another component, they
might need to transition opal/compress to a multi-select framework.
Included changes:
* Consolidate the compression code into the opal/compress framework
* Move the ORTE zlib compression code into a new opal/compress/zlib
component
* Ignore the bzip and gzip components in opal/compress framework
* Add a "compress_base_limit" MCA param to set the threshold above which
we compress data - defaults to 4096 bytes
* Delete stale brucks and rcd components from orte/grpcomm framework
* Delete the orte/regx framework
* Update the launch system to use opal/compress instead of string regex
* Provide a default module if no zlib is available
* Fix some misc multi-node issues
* Properly generate the nidmap in response to a "connection warmup"
message so the remote daemon knows the children it needs to launch.
* Remove stale references to orte_node_regex
* opal_byte_object_t's are not OPAL objects - properly release allocated
memory.
* Set the topology
* Currently only handling homogeneous case
* Update the compress framework files to conform
* Consolidate open/close into one "frame" file. Ensure we open/close the
framework
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-30 03:02:21 +03:00
|
|
|
&opal_shmem_base_framework, &opal_reachable_base_framework, &opal_compress_base_framework,
|
|
|
|
NULL,
|
2018-11-29 01:52:52 +03:00
|
|
|
};
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
int
|
2009-12-04 03:51:15 +03:00
|
|
|
opal_init(int* pargc, char*** pargv)
|
2006-01-16 04:48:03 +03:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2011-07-12 21:07:41 +04:00
|
|
|
if( ++opal_initialized != 1 ) {
|
|
|
|
if( opal_initialized < 1 ) {
|
|
|
|
return OPAL_ERROR;
|
|
|
|
}
|
2007-06-01 06:43:46 +04:00
|
|
|
return OPAL_SUCCESS;
|
|
|
|
}
|
|
|
|
|
2006-01-16 04:48:03 +03:00
|
|
|
/* initialize util code */
|
2009-12-04 03:51:15 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_init_util(pargc, pargv))) {
|
2006-01-16 04:48:03 +03:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2018-11-29 01:52:52 +03:00
|
|
|
OBJ_CONSTRUCT(&opal_init_domain, opal_finalize_domain_t);
|
|
|
|
(void) opal_finalize_domain_init (&opal_init_domain, "opal_init");
|
|
|
|
opal_finalize_set_domain (&opal_init_domain);
|
2011-09-11 23:02:24 +04:00
|
|
|
|
2018-11-29 01:52:52 +03:00
|
|
|
opal_finalize_register_cleanup_arg (mca_base_framework_close_list, opal_init_frameworks);
|
|
|
|
opal_finalize_register_cleanup (opal_tsd_keys_destruct);
|
|
|
|
|
|
|
|
ret = mca_base_framework_open_list (opal_init_frameworks, 0);
|
|
|
|
if (OPAL_UNLIKELY(OPAL_SUCCESS != ret)) {
|
|
|
|
return opal_init_error ("opal_init framework open", ret);
|
2006-04-05 09:57:51 +04:00
|
|
|
}
|
|
|
|
|
2005-09-27 00:20:20 +04:00
|
|
|
/* initialize the memory manager / tracker */
|
2008-05-19 15:57:44 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_mem_hooks_init())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_mem_hooks_init", ret);
|
2008-02-12 11:46:27 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/* select the memory checker */
|
2008-05-19 15:57:44 +04:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_memchecker_base_select())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_memchecker_base_select", ret);
|
2007-05-25 01:54:58 +04:00
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
2007-05-25 01:54:58 +04:00
|
|
|
/*
|
2008-02-12 19:59:59 +03:00
|
|
|
* Initialize the general progress engine
|
2007-05-25 01:54:58 +04:00
|
|
|
*/
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_progress_init())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_progress_init", ret);
|
2007-05-25 01:54:58 +04:00
|
|
|
}
|
|
|
|
/* we want to tick the event library whenever possible */
|
|
|
|
opal_progress_event_users_increment();
|
|
|
|
|
2011-06-21 19:41:57 +04:00
|
|
|
/* setup the shmem framework */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_shmem_base_select())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_shmem_base_select", ret);
|
2011-06-21 19:41:57 +04:00
|
|
|
}
|
2017-09-13 21:43:15 +03:00
|
|
|
|
2018-11-29 01:52:52 +03:00
|
|
|
/* Intitialize reachable framework */
|
2017-09-13 21:43:15 +03:00
|
|
|
if (OPAL_SUCCESS != (ret = opal_reachable_base_select())) {
|
2018-11-29 01:52:52 +03:00
|
|
|
return opal_init_error ("opal_reachable_base_select", ret);
|
2007-03-17 02:11:45 +03:00
|
|
|
}
|
2015-06-24 06:59:57 +03:00
|
|
|
|
Move from the use of regex to compression
We've been fighting the battle of trying to create a regex generator and
parser that can handle arbitrary hostname schemes - without long-term
success. The worst of it is that there is no way of checking to see if
the computed regex is correct short of parsing it and doing a
character-by-character comparison with the original string. Ugh...there
has to be a better solution.
One option is to investigate using 3rd-party regex libraries as
those are coming from communities whose sole focus is resolving that
problem. However, someone would need to spend the time to investigate
it, and we'd have to find a license-friendly implementation.
Another option is to quit beating our heads against the wall and just
compress the information. It won't be as much of a reduction, but we
also won't keep hitting scenarios where things break. In this case, it
seems that "perfection" is definitely the enemy of "good enough".
This PR implements the compression option while retaining the
possibility of people adding regex-generating components. The
compression code used in ORTE is consolidated into the opal/compress
framework. That framework currently held bzip and gzip components for
use in compressing checkpoint files - since we no longer support C/R, I
have .opal_ignore'd those components.
However, I have left the original framework APIs alone in case someone
ever decides to redo C/R. The APIs of interest here are added to the
framework - specifically, the "compress_block" and "decompress_block"
functions. I then moved the ORTE zlib compression code into a new
component in this framework.
Unfortunately, the framework currently is a single-select one - i.e.,
only one active component at a time. Since I .opal_ignore'd the other
two and made the priority of zlib high, this isn't a problem. However,
if someone wants to re-enable bzip/gzip or add another component, they
might need to transition opal/compress to a multi-select framework.
Included changes:
* Consolidate the compression code into the opal/compress framework
* Move the ORTE zlib compression code into a new opal/compress/zlib
component
* Ignore the bzip and gzip components in opal/compress framework
* Add a "compress_base_limit" MCA param to set the threshold above which
we compress data - defaults to 4096 bytes
* Delete stale brucks and rcd components from orte/grpcomm framework
* Delete the orte/regx framework
* Update the launch system to use opal/compress instead of string regex
* Provide a default module if no zlib is available
* Fix some misc multi-node issues
* Properly generate the nidmap in response to a "connection warmup"
message so the remote daemon knows the children it needs to launch.
* Remove stale references to orte_node_regex
* opal_byte_object_t's are not OPAL objects - properly release allocated
memory.
* Set the topology
* Currently only handling homogeneous case
* Update the compress framework files to conform
* Consolidate open/close into one "frame" file. Ensure we open/close the
framework
Signed-off-by: Ralph Castain <rhc@pmix.org>
2019-01-30 03:02:21 +03:00
|
|
|
/* Intitialize compress framework */
|
|
|
|
if (OPAL_SUCCESS != (ret = opal_compress_base_select())) {
|
|
|
|
return opal_init_error ("opal_compress_base_select", ret);
|
|
|
|
}
|
|
|
|
|
2005-11-27 00:18:47 +03:00
|
|
|
return OPAL_SUCCESS;
|
George did the work and deserves all the credit for it. Ralph did the merge, and deserves whatever blame results from errors in it :-)
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
2014-07-26 04:47:28 +04:00
|
|
|
}
|