1
1

17 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
d44e4a311f Per request from Dave Goodell, add support for MPIEXEC_TIMEOUT - if set in the environment, terminate the job after the specified number of seconds has passed. Equivalent to MPICH functionality.
cmr=v1.7.4:reviewer=dgoodell:subject=add support for MPIEXEC_TIMEOUT

This commit was SVN r29831.
2013-12-07 01:58:32 +00:00
Ralph Castain
0f420f3676 Add a little debug
cmr:v1.7.4:reviewer=ompi-gk1.7

This commit was SVN r29705.
2013-11-14 04:22:59 +00:00
Nathan Hjelm
299d5b3dd7 Fix two debugger attach bugs.
- orte_debugger_init_after_spawn was not being called for debuggers that
   use the MPIR_attach_fifo to co-locate debugger daemons.
 - MPIR_Breakpoint was not getting called if a debugger reattached. Add
   a job state (ORTE_JOB_STATE_DEBUGGER_DETACH) to reset mpir_breakpoint_fired
   to false when a debugger detaches to ensure MPIR_Breakpoint is called if
   another debugger attaches. Tested with STAT 2.0/launchmon 1.0.

cmr:v1.7

This commit was SVN r28665.
2013-06-20 16:18:05 +00:00
Nathan Hjelm
c041156f60 Update ORTE frameworks to use the MCA framework system.
This commit was SVN r28240.
2013-03-27 21:14:43 +00:00
Nathan Hjelm
cf377db823 MCA/base: Add new MCA variable system
Features:
 - Support for an override parameter file (openmpi-mca-param-override.conf).
   Variable values in this file can not be overridden by any file or environment
   value.
 - Support for boolean, unsigned, and unsigned long long variables.
 - Support for true/false values.
 - Support for enumerations on integer variables.
 - Support for MPIT scope, verbosity, and binding.
 - Support for command line source.
 - Support for setting variable source via the environment using
   OMPI_MCA_SOURCE_<var name>=source (either command or file:filename)
 - Cleaner API.
 - Support for variable groups (equivalent to MPIT categories).

Notes:
 - Variables must be created with a backing store (char **, int *, or bool *)
   that must live at least as long as the variable.
 - Creating a variable with the MCA_BASE_VAR_FLAG_SETTABLE enables the use of
   mca_base_var_set_value() to change the value.
 - String values are duplicated when the variable is registered. It is up to
   the caller to free the original value if necessary. The new value will be
   freed by the mca_base_var system and must not be freed by the user.
 - Variables with constant scope may not be settable.
 - Variable groups (and all associated variables) are deregistered when the
   component is closed or the component repository item is freed. This
   prevents a segmentation fault from accessing a variable after its component
   is unloaded.
 - After some discussion we decided we should remove the automatic registration
   of component priority variables. Few component actually made use of this
   feature.
 - The enumerator interface was updated to be general enough to handle
   future uses of the interface.
 - The code to generate ompi_info output has been moved into the MCA variable
   system. See mca_base_var_dump().

opal: update core and components to mca_base_var system
orte: update core and components to mca_base_var system
ompi: update core and components to mca_base_var system

This commit also modifies the rmaps framework. The following variables were
moved from ppr and lama: rmaps_base_pernode, rmaps_base_n_pernode,
rmaps_base_n_persocket. Both lama and ppr create synonyms for these variables.

This commit was SVN r28236.
2013-03-27 21:09:41 +00:00
Ralph Castain
cf9796accd Remove the old configure option for disabling full rte support - we now use the OMPI rte framework for such purposes
This commit was SVN r28134.
2013-02-28 01:35:55 +00:00
Nathan Hjelm
bdedd8b0d3 Per RFC modify the behavior of mca_base_components_close to NOT close the output. Modify frameworks to always close their output and set to -1.
Reasoning: The old behavior was a little confusing. mca_base_components_open does not open an output stream so it is a little unexpected that mca_base_components_close does. To add to this several frameworks (that don't use mca_base_components_close) failed to close their output in the framework close function and others closed their output a second time. This change is an improvement to the symantics of mca_base_components_open/close as they are now symetric in their functionality.

This commit was SVN r27570.
2012-11-06 19:09:26 +00:00
Ralph Castain
a080de188f Enable orterun to directly support staged execution, treating each app as a separate job. Support transfer of file maps when support exists.
This commit was SVN r27516.
2012-10-29 23:11:30 +00:00
Ralph Castain
285a3b168d Add an ability to specify the max number of simultaneous procs/node for an application when operating in staged mode. Change some debug statements from OPAL_OUTPUT_VERBOSE to opal_output_verbose so they are available in optimized builds.
This commit was SVN r27445.
2012-10-14 03:31:32 +00:00
Ralph Castain
98580c117b Introduce staged execution. If you don't have adequate resources to run everything without oversubscribing, don't want to oversubscribe, and aren't using MPI, then staged execution lets you (a) run as many procs as there are available resources, and (b) start additional procs as others complete and free up resources. Adds a new mapper as well as a new state machine.
Remove some stale configure.m4's we no longer need.

Optimize the nidmaps a bit by only sending info that has changed each time, instead of sending a complete copy of everything. Makes no difference for the typical MPI job - only impacts things like staged execution where we are sending multiple (possibly many) launch messages.

This commit was SVN r27165.
2012-08-28 21:20:17 +00:00
Ralph Castain
0d878937fe If a callback is set in the state machine, and the state doesn't yet exist, create it
This commit was SVN r26947.
2012-08-03 21:43:36 +00:00
Ralph Castain
431d5361ed For those who really preferred our prior mode of operation that mapped procs and only launched daemons on the nodes that had procs on them, introduce the "novm" state machine component. This recreates the old mode of operation by re-ordering the launch sequence so that we allocate, then map, and then launch daemons only on the reqd nodes (instead of across the entire allocation).
This commit was SVN r26946.
2012-08-03 16:30:05 +00:00
Ralph Castain
b2f77bf08f Extend the iof by adding two new components to support map-reduce IO chaining. Add a mapreduce tool for running such applications.
Fix the state machine to support multiple jobs being simultaneously launched as this is not only required for mapreduce, but can happen under comm-spawn applications as well.

This commit was SVN r26380.
2012-05-02 21:00:22 +00:00
Ralph Castain
9cd4c06488 Get things to build and run when --disable-orte is specified
This commit was SVN r26263.
2012-04-10 21:50:01 +00:00
Ralph Castain
bd8b4f7f1e Sorry for mid-day commit, but I had promised on the call to do this upon my return.
Roll in the ORTE state machine. Remove last traces of opal_sos. Remove UTK epoch code.

Please see the various emails about the state machine change for details. I'll send something out later with more info on the new arch.

This commit was SVN r26242.
2012-04-06 14:23:13 +00:00
Ralph Castain
9dfb5c7c62 Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set
This commit was SVN r23082.
2010-05-03 04:11:03 +00:00
Ralph Castain
ffd5be6aa1 Add a new framework to ORTE for saving and recovering state information. Two components are included that use the db or dbm library for storing the data, with a distributed hash table component coming later.
Note that each of these components will only be selected if specifically requested - otherwise, a "NULL" component will be used.  The framework is only opened by the HNP and orteds, though neither is currently coded to save/restore state

This commit was SVN r22839.
2010-03-16 20:59:48 +00:00