1
1
Граф коммитов

597 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
12cd07c9a9 Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases.
The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang.

This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is:

* pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place

* modified the errmgr to directly call the new routines when termination is detected

* removed the grpcomm.onesided_barrier and its associated RML tag

* add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree

* use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero

Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial.

This commit was SVN r23429.
2010-07-17 21:03:27 +00:00
Ralph Castain
23b166cbd4 Initialize pointers to NULL
This commit was SVN r23422.
2010-07-15 16:33:57 +00:00
Ralph Castain
2193ace464 Ensure the debugger symbols are loaded into orterun prior to orte_init so they are found by attaching debuggers.
This commit was SVN r23368.
2010-07-09 17:51:00 +00:00
Ralph Castain
31295e8dc2 As discussed on today's telecon, reorganize the debugger attachment code in orte to better support efforts within the tool community aimed at exploring alternative methods. Move the debugger attachment code from the orterun directory to a new debugger framework. Organize the existing standard support code into an "mpir" component. Organize the current extensions for co-spawning debugger daemons into a separate "mpirx" component.
Since the MPIR symbols are now included in the ORTE library, remove duplicate declarations in OMPI and replace them with extern references to their ORTE instantiations.

This commit was SVN r23360.
2010-07-06 23:35:42 +00:00
Jeff Squyres
8fef296b8a Updates about thread support levels.
This commit was SVN r23341.
2010-07-02 13:14:09 +00:00
Ralph Castain
628936a99f Provide a convenience option to disable fault recovery (as opposed to setting three separate, long-named mca params)
This commit was SVN r23327.
2010-07-01 19:31:11 +00:00
Ralph Castain
099c3aad97 Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0.
This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all.

Thanks to Dong Ahn (LLNL) for catching this problem!

Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm.

This commit was SVN r23300.
2010-06-24 05:13:53 +00:00
Ralph Castain
ae746a390f Debugger daemons spawned upon attachment to a running job need to be treated just like a regular job - they are not "piggybacking" onto an existing launch, and so the orte daemons need to report them just like a regular job launch in order to release from spawn.
Modify the debugger control flag to include "do not monitor" so orterun will not take debugger daemon termination into account when deciding that all jobs are done.

This commit was SVN r23282.
2010-06-19 15:22:36 +00:00
Jeff Squyres
f1a7b5cc33 Make "processor affinity not supported" error message a little better:
* Remove OPAL_ERR_PAFFINITY_NOT_SUPPORTED; fit it into the generic
   OPAL_ERR_NOT_SUPPORTED case.
 * When odls_default detects that processor affinity is not supported,
   it prints a specific message about it, and then it suppressed a
   generic HNP help message that would normally follow it (i.e., it's
   easier to have the "processor affinity is not supported" show_help
   message last).
 * Use some symbolic names in odls_default instead of fixed int's,
   just for slight readability improvements in the code.
 * Introduce orte_show_help_suppress(), which gives the ability to
   suppress any future showings of any arbitrary show_help() message.
   This is useful if you display message X and want to suppress
   message Y.  This suppression *only* works in environments where
   orte_show_help() does coalescing.

This commit was SVN r23249.
2010-06-08 20:16:07 +00:00
Jeff Squyres
e734939ddf Minor wrod wrapping to make the text fit within the show_help lines.
This commit was SVN r23219.
2010-05-28 20:25:54 +00:00
Josh Hursey
f57e73d4e5 add a few more missing SOS includes
This commit was SVN r23168.
2010-05-18 15:00:07 +00:00
Jeff Squyres
76d1bb3c96 Per some off-list discussion with Ralph and George, MPIR_being_debugged
really should be volatile.

This commit was SVN r23166.
2010-05-18 11:42:41 +00:00
Abhishek Kulkarni
260eaa24dd Add a missing header. Bleh.
This commit was SVN r23165.
2010-05-17 23:55:53 +00:00
Abhishek Kulkarni
8335ff3893 Fixing a branch merge issue. Looks like we picked the wrong branch when resolving
a conflict.

This commit was SVN r23164.
2010-05-17 23:49:46 +00:00
Abhishek Kulkarni
afbe3e99c6 * Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
 SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
 back the native error code.

* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
  (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
  decode 'ret' to get the native error code.

This commit was SVN r23162.
2010-05-17 23:08:56 +00:00
Ralph Castain
88f5217a12 Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired).
Add new mca params to test:

orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch
orte_debugger_test_attach: Test debugger colaunch after debugger attachment

To test co-launch at job start, just set the orte_debugger_test_daemon param.

To test co-launch upon attach:
set orte_debugger_test_daemon
set orte_debugger_test_attach=1
set orte_enable_debug_cospawn_while_running=1
set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching

Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon.

This commit was SVN r23144.
2010-05-14 18:44:49 +00:00
Ralph Castain
7ce34223f1 Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases.
Implement process migration when failed nodes are detected. Some testing still required

This commit was SVN r23121.
2010-05-12 18:11:58 +00:00
Ralph Castain
4bd25f587c Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t
he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so.

Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort.

This commit was SVN r23112.
2010-05-11 00:34:12 +00:00
Ralph Castain
d6a1d7a082 Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes.
This commit was SVN r23107.
2010-05-07 14:04:55 +00:00
Ralph Castain
2ff1ae13e1 Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set.
This commit was SVN r23102.
2010-05-05 00:48:43 +00:00
Ralph Castain
9dfb5c7c62 Rename the orte state framework to be "db", which more accurately reflects its overall capabilities since it can store any kind of data (not just state, although that will be its primary purpose). Update tools and tests accordingly. Add a daemon module for storing data on the daemons - requires --enable-multicast, so it won't build unless that is set
This commit was SVN r23082.
2010-05-03 04:11:03 +00:00
Ralph Castain
319758e3e0 Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params:
1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis

2. orte_max_local_restarts - default max number of local restarts, can be overridden

3. orte_max_global_restarts - default max number of relocates, can be overridden

Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code.

This commit was SVN r23057.
2010-04-28 04:06:57 +00:00
Ralph Castain
b9893aacc5 Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here:
1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified.

2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set.

This support must be enabled by configuring --enable-sensors.

ompi_info and orte-info have been updated to include the new framework.

Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized.

Implementation continues.

This commit was SVN r23043.
2010-04-26 22:15:57 +00:00
Ralph Castain
efbb5c9b7c Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base:
* add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway.

* update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states.

* remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails.

* update the plm_base_receive.c code to call the errmgr with state updates from remote daemons

* update the routed modules to reflect that process state is updated in the errmgr

* ensure that the orted's open the errmgr and select their appropriate module

* add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place

* define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired.

* update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios

This commit was SVN r23023.
2010-04-23 04:44:41 +00:00
Shiqing Fan
d1e66bdd01 Use variables instead of hard-coded compiler flags, in order to support various C/C++ compilers on Windows.
This commit was SVN r23016.
2010-04-21 12:45:00 +00:00
Ralph Castain
4308922f59 Ensure that any application-specific selection of ess module doesn't get overridden by what is given to the orted or orterun
Cleanup tool name determination for CM

This commit was SVN r22980.
2010-04-15 18:10:50 +00:00
Ralph Castain
c32f046d7c Tiny cleanup - when the user kills us with a ctrl-c, there really isn't a need to tell him "your procs died and we don't know why". Just shaddup and die.
This commit was SVN r22949.
2010-04-09 18:47:35 +00:00
Ralph Castain
de6679dbd3 Truly respect the -quiet option. Make it an mca param so someone doesn't have to put it solely on the cmd line. Tell show_help to shaddup as well.
This commit was SVN r22926.
2010-04-02 14:19:38 +00:00
Ralph Castain
25a9a195b0 When the user requests "quiet", mpirun isn't supposed to output "helpful error messages" - so shaddup!
This commit was SVN r22925.
2010-04-02 07:13:11 +00:00
Jeff Squyres
c57a8fba5a Improve the help message when mpirun cannot find an executable.
Refs trac:2035

This commit was SVN r22922.

The following Trac tickets were found above:
  Ticket 2035 --> https://svn.open-mpi.org/trac/ompi/ticket/2035
2010-04-01 13:26:29 +00:00
Ralph Castain
24c3b4f849 Add the sysinfo framework to the "info" tools, especially since the odls_base_open function calls it!
This commit was SVN r22901.
2010-03-29 20:47:29 +00:00
Josh Hursey
e4f2d03d28 ErrMgr Framework redesign to better support fault tolerance development activities.
Explained in more detail in the following RFC:
  http://www.open-mpi.org/community/lists/devel/2010/03/7589.php

This commit was SVN r22872.
2010-03-23 21:28:02 +00:00
Josh Hursey
9e967a3a9b Revert this change, since even in the CR case we want to reset this var to NULL.
Thanks to Jeff for the catch.

This commit was SVN r22868.
2010-03-23 19:55:21 +00:00
Shiqing Fan
9591680ec0 One of the binaries was generated from a wrong source.
This commit was SVN r22865.
2010-03-23 09:56:11 +00:00
Ralph Castain
e291fc2c69 With Jeff's help, get the libraries to link as required.
Update ompi_info and orte-info to include the new framework.

Fix some selection logic and a typo'd variable name

Still remains ompi_ignored until we complete testing

This commit was SVN r22848.
2010-03-18 02:12:59 +00:00
Ralph Castain
b400b84162 Merge in the modified thread configure option branch per today's telecon.
Remove the --enable-progress-threads option as this is no longer functional, and hardcode OPAL_ENABLE_PROGRESS_THREADS to 0.

Replace the --enable-mpi-threads option with --enable-mpi-thread-multiple as this is clearer as to meaning. This option automatically turns "on" opal thread support if it wasn't already so specified. If the user specifies --disable-opal-multi-threads --enable-mpi-thread-multiple, we will error out with a message

Add a new --enable-opal-multi-threads option that turns "on" opal thread support without doing anything wrt mpi-thread-multiple

This commit was SVN r22841.
2010-03-16 23:10:50 +00:00
Rainer Keller
814fb9399f - Further patches for support on NetBSD (and DragonFly) by
Aleksej Saushev.
   Dont use bash or bashism in shell scripts
   We should use Posix' setpgid(0,0), which is equivalent to setpgrp().

This commit was SVN r22829.
2010-03-15 05:33:42 +00:00
Josh Hursey
e9b5162d79 Fix the configure logic for --with-ft so that it properly takes a comma separated list.
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.

The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.

This commit was SVN r22824.
2010-03-12 23:57:50 +00:00
Ralph Castain
f2c65dc70f Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic.
Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits.

This commit was SVN r22783.
2010-03-05 13:22:12 +00:00
Ralph Castain
2541aa98ab Change the app_idx type to uint32_t to support users who use large numbers of app_contexts. Set it up as a new typedef so we can change it later without as much effort.
This commit was SVN r22727.
2010-02-27 17:37:34 +00:00
Ralph Castain
6c0d7940c7 Add a new MCA param (and corresponding mpirun cmd line option) to output the debugger proctable info after launch. The output is just the job map with the process pid included, so you get a node-by-node list of the process ranks on that node and thier pids.
Works for initial launch and comm_spawn. xml and non-xml output is available

This commit was SVN r22725.
2010-02-27 08:32:25 +00:00
Ralph Castain
18c7aaff08 Update the grpcomm framework to be more thread-friendly.
Modify the orte configure options to specify --enable-multicast such that it directs components to build or not instead of littering the code base with #if's. Remove those #if's where they used to occur.

Add a new grpcomm "mcast" module to support multicast operations. Still some work required to properly perform daemon collectives for comm_spawn operations. New module only builds when --enable-multicast is provided, and when specifically selected.

This commit was SVN r22709.
2010-02-25 01:11:29 +00:00
Jeff Squyres
af6f1f4b00 Add pkg-config(1) config files to Open MPI. Additionally, fix a minor
bug: libmpi_f90 had libmpi.la in its LIBADD instead of libmpi_f77.la.

Fixes trac:2244.

This commit was SVN r22704.

The following Trac tickets were found above:
  Ticket 2244 --> https://svn.open-mpi.org/trac/ompi/ticket/2244
2010-02-24 18:46:06 +00:00
Jeff Squyres
d9b6b5af0c This commit converts us to the "one big libmpi" scheme that has been
discussed extensively.  See
https://svn.open-mpi.org/trac/ompi/ticket/2092 and the RFC thread
http://www.open-mpi.org/community/lists/devel/2010/02/7447.php.

Specifically:

 * Create LT convenience libraries for OPAL and ORTE if the layer
   above them is being created (use the already-defined
   AM_CONDITIONALs to know if the project above us is being built).
 * ORTE slurps in the LT convenience library for OPAL; OMPI slurps in
   the LT convenience library for ORTE.
 * Wrapper compilers now only -l one library (e.g., ortecc only does
   -lopen-ret, and mpicc only does -lmpi).

This commit was SVN r22691.
2010-02-23 22:20:01 +00:00
Shiqing Fan
08ffdbe987 Changes for portable platform headers. Commit it on behalf of Ralph.
This commit was SVN r22619.
2010-02-15 22:14:59 +00:00
Ralph Castain
58a1151566 Ensure the man page gets into the tarball
This commit was SVN r22613.
2010-02-13 02:39:10 +00:00
Ralph Castain
dc6de3e9b5 Add an "orte-info" tool to report ORTE and OPAL configuration and parameter info ala "ompi_info". Temporary solution until refactoring of the "info" system can be undertaken.
This commit was SVN r22608.
2010-02-11 23:02:14 +00:00
Josh Hursey
ec4498c258 update copyright
This commit was SVN r22605.
2010-02-11 14:03:36 +00:00
Josh Hursey
4513440df9 The ordering of the appfile matters. The ranks must be in increasing rank
order, so that they get assigned the same ORTE name on restart.

cmr:v1.4.2
cmr:v1.5

This commit was SVN r22586.
2010-02-09 00:31:24 +00:00
Eugene Loh
924bc10ceb Add description of the sequential mapper (-mca rmaps seq) to the
mpirun man page.

This commit was SVN r22567.
2010-02-06 06:22:29 +00:00