This is a point-in-time update that includes support for several new PMIx features, mostly focused on debuggers and "instant on":
* initial prototype support for PMIx-based debuggers. For the moment, this is restricted to using the DVM. Supports direct launch of apps under debugger control, and indirect launch using prun as the intermediate launcher. Includes ability for debuggers to control the environment of both the launcher and the spawned app procs. Work continues on completing support for indirect launch
* IO forwarding for tools. Output of apps launched under tool control is directed to the tool and output there - includes support for XML formatting and output to files. Stdin can be forwarded from the tool to apps, but this hasn't been implemented in ORTE yet.
* Fabric integration for "instant on". Enable collection of network "blobs" to be delivered to network libraries on compute nodes prior to local proc spawn. Infrastructure is in place - implementation will come later.
* Harvesting and forwarding of envars. Enable network plugins to harvest envars and include them in the launch msg for setting the environment prior to local proc spawn. Currently, only OmniPath is supported. PMIx MCA params control which envars are included, and also allows envars to be excluded.
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
Initialize the reachable framework during opal_init() and tear
it back down during opal_finalize(). The framework was never
used, so the lack of initialization didn't matter, but this is
a required step in using the framework.
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Still in the "needs to be done" category:
* mapping/ranking/binding options aren't correctly supported
* if the DVM encounters some errors (e.g., not enough resources for the job), the resulting error is globally set and impacts any subsequent job submission
Signed-off-by: Ralph Castain <rhc@open-mpi.org>
there is no such thing as pthread_join(main_thread), so key destructors
are never invoked on the main thread, which causes valgrind report
some memory leaks. Manually store and then invoke the key destructors and
make valgrind happy.
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
the class system can be initialized/finalized as many times as we like,
so there is no more need to have opal_class_finalize() invoked in a destructor
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Add PMIx 2.0
Remove PMIx 1.1.4
Cleanup copying of component
Add missing file
Touchup a typo in the Makefile.am
Update the pmix ext114 component
Minor cleanups and resync to master
Update to latest PMIx 2.x
Update to the PMIx event notification branch latest changes
Per discussion on https://github.com/open-mpi/ompi/pull/1767 (and some
subsequent phone calls and off-issue email discussions), the PSM
library is hijacking signal handlers by default. Specifically: unless
the environment variables `IPATH_NO_BACKTRACE=1` (for PSM / Intel
TrueScale) is set, the library constructor for this library will
hijack various signal handlers for the purpose of invoking its own
error reporting mechanisms.
This may be a bit *surprising*, but is not a *problem*, per se. The
real problem is that older versions of at least the PSM library do not
unregister these signal handlers upon being unloaded from memory.
Hence, a segv can actually result in a double segv (i.e., the original
segv and then another segv when the now-non-existent signal handler is
invoked).
This PSM signal hijacking subverts Open MPI's own signal reporting
mechanism, which may be a bit surprising for some users (particularly
those who do not have Intel TrueScale). As such, we disable it by
default so that Open MPI's own error-reporting mechanisms are used.
Additionally, there is a typo in the library destructor for the PSM2
library that may cause problems in the unloading of its signal
handlers. This problem can be avoided by setting `HFI_NO_BACKTRACE=1`
(for PSM2 / Intel OmniPath).
This is further compounded by the fact that the PSM / PSM2 libraries
can be loaded by the OFI MTL and the usNIC BTL (because they are
loaded by libfabric), even when there is no Intel networking hardware
present. Having the PSM/PSM2 libraries behave this way when no Intel
hardware is present is clearly undesirable (and is likely to be fixed
in future releases of the PSM/PSM2 libraries).
This commit sets the following two environment variables to disable
this behavior from the PSM/PSM2 libraries (if they are not already
set):
* IPATH_NO_BACKTRACE=1
* HFI_NO_BACKTRACE=1
If the user has set these variables before invoking Open MPI, we will
not override their values (i.e., their preferences will be honored).
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Take another shot at untangling the spaghetti
orterun: fix for command line parsing
orte-submit calls opal_init_util () before parsing out MCA command line
options (-mca, -am, etc). This prevents mpirun from setting opal MCA
variables for some frameworks as well as the MCA base. This is because
when a framework is opened all of its variables are set to read-only.
Eventually we want to lift this restriction on some MCA variables but
since -mca is affected we must parse out the MCA command line options
before opal_init_util(). This commit fixes the bug by adding a new
option to opal_cmd_line_parse (ignore unknown option) so orte-submit
can pre-parse the command line for MCA options.
Signed-off-by: Nathan Hjelm <hjelmn@me.com>
Minor cleanups to avoid releasing/recreating the cmd line
Define OPAL_MAXHOSTNAMELEN to be either:
(MAXHOSTNAMELEN + 1) or
(limits.h:HOST_NAME_MAX + 1) or
(255 + 1)
For pmix code, define above using PMIX_MAXHOSTNAMELEN.
Fixup opal layer to use the new max.
Signed-off-by: Karol Mroz <mroz.karol@gmail.com>
Because of the removal of the linux memory component it is no longer
necessary to initialize the memory component in opal_init(). This
commit moves the initialization to the creation of the first rcache
component.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit adds a framework to abstract runtime code patching.
Components in the new framework can provide functions for either
patching a named function or a function pointer. The later
functionality is not being used but may provide a way to allow memory
hooks when dlopen functionality is disabled.
This commit adds two different flavors of code patching. The first is
provided by the overwrite component. This component overwrites the
first several instructions of the target function with code to jump to
the provided hook function. The hook is expected to provide the full
functionality of the hooked function.
The linux patcher component is based on the memory hooks in ucx. It
only works on linux and operates by overwriting function pointers in
the symbol table. In this case the hook is free to call the original
function using the function pointer returned by dlsym.
Both components restore the original functions when the patcher
framework closes.
Changes had to be made to support Power/PowerPC with the Linux
dynamic loader patcher. Some of the changes:
- Move code necessary for powerpc/power support to the patcher
base. The code is needed by both the overwrite and linux
components.
- Move patch structure down to base and move the patch list to
mca_patcher_base_module_t. The structure has been modified to
include a function pointer to the function that will unapply the
patch. This allows the mixing of multiple different types of
patches in the patch_list.
- Update linux patching code to keep track of the matching between
got entry and original (unpatched) address. This allows us to
completely clean up the patch on finalize.
All patchers keep track of the changes they made so that they can be
reversed when the patcher framework is closed.
At this time there are bugs in the Linux dynamic loader patcher so
its priority is lower than the overwrite patcher.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
A few uninitialized common symbols are remaining:
common symbols generated by flex :
* opal/util/keyval/keyval_lex.l: opal_util_keyval_yyleng
* opal/util/keyval/keyval_lex.o: opal_util_keyval_yytext
* opal/util/show_help_lex.l: opal_show_help_yyleng
* opal/util/show_help_lex.l: opal_show_help_yytext
common symbol generated by "external" hwloc library:
* opal/mca/hwloc/hwloc191/hwloc/src/components.o: component_map
This commit is related to an RFC from June 2014. Disscussion can be
found at:
http://www.open-mpi.org/community/lists/devel/2014/07/15140.php
The finalize function is set using either the linker option -fini or
__attribute__((destructor)) depending on compiler support. I have
confirmed that this hybrid approach works with all the major
compilers. The attribute is supported by gcc, clang, llvm, xlc, and
icc. The fini function will support pgi. If a compiler/linker
combination does not support either the destructor or fini function a
message will be printed on re-init indicating it is not supported (an
improvement over the current behavior-- SEGV).
I moved the following to the destructor function:
- Class system finalize. This solves a bug when MPI_T_finalize is
called before MPI_Init. The only downside to this change is we will
leave the footprint of the opal class system after
MPI_Finalize. This footprint should be relatively small.
This is an alternative to #517 but the two PRs are not
mutually-exclusive (with some modifications). This commit should also
be safe for 1.8.x as it does not change internal or external ABI (#517
changes internal ABI).
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit is a rework of the component repository. The changes
included in this commit are:
- Remove the component dependency code based off .ompi_info
files. This code is legacy code dating back 10 years that and is no
longer used.
- Move the plugin scanning code to the component repository. New
calls have been added to add new scanning paths, query available
components, and dlopen/load components.
- Pass the framework down to mca_base_component_find/filter. Eventually
the framework structure will be used to further validate components
before they are used.
- Add support to the MCA framework system to disable scanning for
dlopened components on open (support already existed in
register). This is really only relevant to installdirs as it has no
register function and no DSO components.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
This commit fixes several vagrind errors. Included:
- installdirs did not correctly reinitialize all pointers to NULL
at close. This causes valgrind errors on a subsequent call to
opal_init_tool.
- several opal strings were leaked by opal_deregister_params which
was setting them to NULL instead of letting them be freed by the
MCA variable system.
- move opal_net_init to AFTER the variable system is initialized and
opal's MCA variables have been registered. opal_net_init uses a
variable registered by opal_register_params!
- do not leak ompi_mpi_main_thread when it is allocated by
MPI_T_init_thread.
- do not overwrite ompi_mpi_main_thread if it is already set (by
MPI_T_init_thread).
- mca_base_var: read_files was overwritting mca_base_var_file_list
even if it was non-NULL.
- mca_base_var: set all file global variables to initial states on
finalize.
- btl/vader: decrement enumerator reference count to ensure that it
is freed.
Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
Properly setup the opal_process_info structure early in the initialization procedure. Define the local hostname right at the beginning of opal_init so all parts of opal can use it. Overlay that during orte_init as the user may choose to remove fqdn and strip prefixes during that time. Setup the job_session_dir and other such info immediately when it becomes available during orte_init.
WHAT: Merge the PMIx branch into the devel repo, creating a new
OPAL “lmix” framework to abstract PMI support for all RTEs.
Replace the ORTE daemon-level collectives with a new PMIx
server and update the ORTE grpcomm framework to support
server-to-server collectives
WHY: We’ve had problems dealing with variations in PMI implementations,
and need to extend the existing PMI definitions to meet exascale
requirements.
WHEN: Mon, Aug 25
WHERE: https://github.com/rhc54/ompi-svn-mirror.git
Several community members have been working on a refactoring of the current PMI support within OMPI. Although the APIs are common, Slurm and Cray implement a different range of capabilities, and package them differently. For example, Cray provides an integrated PMI-1/2 library, while Slurm separates the two and requires the user to specify the one to be used at runtime. In addition, several bugs in the Slurm implementations have caused problems requiring extra coding.
All this has led to a slew of #if’s in the PMI code and bugs when the corner-case logic for one implementation accidentally traps the other. Extending this support to other implementations would have increased this complexity to an unacceptable level.
Accordingly, we have:
* created a new OPAL “pmix” framework to abstract the PMI support, with separate components for Cray, Slurm PMI-1, and Slurm PMI-2 implementations.
* Replaced the current ORTE grpcomm daemon-based collective operation with an integrated PMIx server, and updated the grpcomm APIs to provide more flexible, multi-algorithm support for collective operations. At this time, only the xcast and allgather operations are supported.
* Replaced the current global collective id with a signature based on the names of the participating procs. The allows an unlimited number of collectives to be executed by any group of processes, subject to the requirement that only one collective can be active at a time for a unique combination of procs. Note that a proc can be involved in any number of simultaneous collectives - it is the specific combination of procs that is subject to the constraint
* removed the prior OMPI/OPAL modex code
* added new macros for executing modex send/recv to simplify use of the new APIs. The send macros allow the caller to specify whether or not the BTL supports async modex operations - if so, then the non-blocking “fence” operation is used, if the active PMIx component supports it. Otherwise, the default is a full blocking modex exchange as we currently perform.
* retained the current flag that directs us to use a blocking fence operation, but only to retrieve data upon demand
This commit was SVN r32570.
WHAT: Open our low-level communication infrastructure by moving all necessary components (btl/rcache/allocator/mpool) down in OPAL
All the components required for inter-process communications are currently deeply integrated in the OMPI layer. Several groups/institutions have express interest in having a more generic communication infrastructure, without all the OMPI layer dependencies. This communication layer should be made available at a different software level, available to all layers in the Open MPI software stack. As an example, our ORTE layer could replace the current OOB and instead use the BTL directly, gaining access to more reactive network interfaces than TCP. Similarly, external software libraries could take advantage of our highly optimized AM (active message) communication layer for their own purpose. UTK with support from Sandia, developped a version of Open MPI where the entire communication infrastucture has been moved down to OPAL (btl/rcache/allocator/mpool). Most of the moved components have been updated to match the new schema, with few exceptions (mainly BTLs where I have no way of compiling/testing them). Thus, the completion of this RFC is tied to being able to completing this move for all BTLs. For this we need help from the rest of the Open MPI community, especially those supporting some of the BTLs. A non-exhaustive list of BTLs that qualify here is: mx, portals4, scif, udapl, ugni, usnic.
This commit was SVN r32317.
Wire the security check into ORTE's OOB handshake, and add a "version" check to ensure that both ends are from the same ORTE version. If not, report the mismatch and refuse the connection
Fixes trac:4171
cmr=v1.7.5:reviewer=jsquyres:subject=Add a security framework for authenticating connections
This commit was SVN r30551.
The following Trac tickets were found above:
Ticket 4171 --> https://svn.open-mpi.org/trac/ompi/ticket/4171
Provide some nice error messages if we fail to set the limits. Since the user had to specifically request we set the limit, treat failure as an error-out situation.
This commit was SVN r28288.