1
1
Граф коммитов

2215 Коммитов

Автор SHA1 Сообщение Дата
Ralph Castain
dd959f5ab6 Silence an idiotic warning
This commit was SVN r23819.
2010-09-30 17:54:13 +00:00
Jeff Squyres
73bcc4a36b Fix mistake that came in via the ompi-agen tree in r23764. The mistake wasn't part of the core autogen upgrade; it was an additional 'bonus' cleanup. Oops. The mistake will always create a set of directories under installdir, even if you do not --with-devel-headers. The set of directories will be empty, but still -- they should not be there at all. This commit fixes that -- the directories are not created at all if you do not --with-devel-headers
This commit was SVN r23801.

The following SVN revision numbers were found above:
  r23764 --> open-mpi/ompi@40a2bfa238
2010-09-24 22:53:28 +00:00
Ralph Castain
3631e4e936 Revert remaining svn kruft from r23764
This commit was SVN r23786.

The following SVN revision numbers were found above:
  r23764 --> open-mpi/ompi@40a2bfa238
2010-09-22 01:11:40 +00:00
Ralph Castain
40a2bfa238 WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues.
This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change.

Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation.

This commit was SVN r23764.
2010-09-17 23:04:06 +00:00
Abhishek Kulkarni
a143622b54 Remove unused code. Notifier events are aggregated on a per-event basis by the HNP notifier.
This commit was SVN r23711.
2010-09-02 16:00:22 +00:00
Ralph Castain
f75437f5a3 Add the ability to receive notifier output when job completes. Set the notification level to INFO for normal job completion, and to ALERT for abnormal termination.
This commit was SVN r23710.
2010-09-02 14:42:41 +00:00
Ralph Castain
b982f908e8 Fixed some newly-induced warnings
This commit was SVN r23694.
2010-08-31 14:51:19 +00:00
Rainer Keller
97511912ec - Fixup several functions, that cannot return
- Add one instance where we do not use a parameter in a function
 - Fix a buglet in commit r23689, where the attribute-for-function ptrs
   was applied.

This commit was SVN r23690.

The following SVN revision numbers were found above:
  r23689 --> open-mpi/ompi@5eb571c458
2010-08-31 12:21:13 +00:00
Rainer Keller
5eb571c458 - As suggested in CMR #2558, attribute-macros should be
be tested on function pointers and assigned accordingly,
   instead of using the pre-processor in the header files.

   A functional change is (re-) specifying __opal_attribute_noreturn__
   on orte_errmgr_base_abort(): All modules in the errmgr framework
   either use this function, or define their own abort function,
   which sets __opal_attribute_noreturn__.
   This attributes was taken out with the errmgr overhaul in r22872.

This commit was SVN r23689.

The following SVN revision numbers were found above:
  r22872 --> open-mpi/ompi@e4f2d03d28
2010-08-31 10:28:51 +00:00
Ralph Castain
b81358815c Add some debug
This commit was SVN r23686.
2010-08-29 13:45:10 +00:00
Ralph Castain
554aede041 Fix a situation where we were unlocking a thread that isn't locked for the main launch - it is only used for dynamic spawns.
This commit was SVN r23682.
2010-08-28 14:03:17 +00:00
Rainer Keller
4abcf5a0d7 - The Sun-compiler 12 update 1 complains about noreturn-attributes
assigned to function-declarations.
   Check this case and mark the currently only case existing in trunk.

   Thanks to Paul Hargrove for bringing this up.

   Let's test the svn commit msg CMR:v1.5

This commit was SVN r23676.
2010-08-27 09:18:30 +00:00
Rainer Keller
12ed573e5e - Include <strings.h> for rindex(3).
Thanks to Paul Hargrove.

   Please CMR:v1.5

This commit was SVN r23671.
2010-08-26 13:42:36 +00:00
Shiqing Fan
9911797867 Rename a few odls help files for Windows installation.
This commit was SVN r23668.
2010-08-26 09:31:18 +00:00
Ralph Castain
2e223abe33 Restore the auto-poll method for detecting debugger attachment, but only in the mpirx debugger module and only if the corresponding rate mca param is set.
Guess we missed it before, but add the debugger framework to the orte-info and ompi_info tools

This commit was SVN r23667.
2010-08-25 22:52:33 +00:00
Ralph Castain
4ecd9a0bbe Protect against an obscure race condition that AFAICT only occurs when we are in a loop waiting to recv a message from a peer who is then killed by signal.
This commit was SVN r23662.
2010-08-25 15:35:01 +00:00
Ralph Castain
f1a00c9a21 Per Jeff's inquiry, play chicken and don't assume herror exists everywhere.
This commit was SVN r23656.
2010-08-24 20:46:41 +00:00
Jeff Squyres
207ca2d928 This commit is the first of several steps in a paffinity makeover
extravaganza.

= Short version =

This commit does several things, but the short version is that it
re-orients the error message creation of the ODLS default module to
generate error strings in the child process for errors that occur
after the fork but before the exec (such errors are ''usually''
related to paffinity).  A show_help string is rendered in the child
and then IPC'ed up to the parent, who displays the string through
normal ORTE show_help aggregation mechanisms.  We also broke up the
ginormous paffinity-setting logic into a few separate functions, both
to help us understand the code, and hopefully to ease future
maintenance.

The logic for the ODLS default binding should not have changed -- this
is mainly a code reshuffle and improvement on error reporting.

= Rationale =

The reasoning for this commit is complex.  As mentioned above, it's
the first step in some paffinity cleanup.  Here's the line of dominoes
that must fall (in this order):

 1. Add hwloc paffinity component (already done).
 1. While testing hwloc, we discovered that the error reporting from
    the ODLS default module was abysmal.  So we fixed it.
 1. Further, we reorganized the code in the odsl_default_module.c a bit
    to help our understanding of it.
 1. We also discovered a few bugs in the original ODLS default module
    logic that existed before this code shuffle; separate tickets
    will be filed to fix them.
 1. Next up will be some improvements to paffinity / odls default to
    make the act of binding to a core ensure to bind to ''all''
    hardware threads contained in that core (similar for sockets:
    binding to a socket will bind to ''all'' hardware threads in that
    socket).
 1. Next will be improvements to paffinity to expose binding to
    hardware threads through the paffinity framework API.
 1. Finally, we'll expose these binding controls to the user (e.g.,
    through mpirun command line arguments, MCA parameters, etc.).

This commit represents the first few bullets; the last 4 bullets are
being worked on right now, but there is no definite timeline for
completion. 

= Miscelaneous =

A few points worth mentioning:

 * We have tested this new code a bunch; we're pretty sure it behaves
   just like the trunk -- but with better / more precise error
   reporting.  More testing is needed on a wider array of platforms,
   however. 
 * A big comment at the top of odls_default_module.c explains the
   (new) general scheme for the error reporting.
 * The error reporting in the parent process is now really dumb;
   almost all the intelligence about creating error messages is in the
   child.
 * The show_help file was renamed to be more consistent with other
   help files (help-odls-default.txt -> help-orte-odls-default.txt)
 * Removed the use of sched_yield() because of recent changes in the
   Linux 2.6.3x kernels.  We already had an #else clause for
   select()'ing for 1us if we didn't have sched_yield() -- that is now
   the only code path.  This is not a performance-critical section of
   the code, so this shouldn't be controversial.
 * Replaced the macro-based error reporting with function-based
   reporting.  It's a bit more bulky, but it helped us understand the
   code and saved us multiple times with compile-time parameter
   checking, etc.
 * Cleaned up the use of several show_help messages to ensure that
   they mapped to real messages in help*.txt files.

This commit was SVN r23652.
2010-08-24 19:38:29 +00:00
Ralph Castain
3b3cd67d07 If we are using static ports and cannot resolve a hostname, then see if the proc is on the local host. If so, then attempt to use a loopback interface to complete the connection. Only implemented for IPv4 because the if.c code has been so hashed I couldn't figure out how to do this cleanly for all cases.
This commit was SVN r23647.
2010-08-24 14:14:59 +00:00
Ralph Castain
7608513158 Cleanup the code and add some comments to make it easier to understand. Add a bozo error check
This commit was SVN r23639.
2010-08-24 04:46:59 +00:00
Ralph Castain
2886da5669 Ensure that the local daemon vpid gets defined so that the locality procedures work when using the ess generic module.
This commit was SVN r23638.
2010-08-24 04:38:21 +00:00
Josh Hursey
4ffc2d6f68 fix a couple of missed prefixes
This commit was SVN r23629.
2010-08-19 13:26:33 +00:00
Josh Hursey
fabd5cc153 Simplification of the ErrMgr framework by removing the 'stack'/composite functionality.
The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably.

Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component.

So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning.

This commit was SVN r23628.
2010-08-19 13:09:20 +00:00
Josh Hursey
77792c937d When we checkpoint with the --stop option, be sure to write out all the metadata before clearing the storage handle.
Here we tried to write out the session directory marker after we sync the directory, which happens early in the case of --stop.

Thanks to Ananda Mudar for noticing the bug.

This commit was SVN r23627.
2010-08-18 20:44:03 +00:00
Brian Barrett
13c827dda8 Make trunk compile on Red Storm again
This commit was SVN r23622.
2010-08-17 21:51:38 +00:00
Ralph Castain
bbf84fd92b Refine the protection from cross-dvm communications
This commit was SVN r23615.
2010-08-16 16:33:39 +00:00
Ralph Castain
930f7adb0f Check the return status and report any error
This commit was SVN r23611.
2010-08-13 15:04:59 +00:00
Ralph Castain
4491a0e5dc Add a channel for reporting errors, fix a bug in the tcp module
This commit was SVN r23610.
2010-08-13 15:04:22 +00:00
Ralph Castain
ace1f60429 Rename an mca param to something more intuitive and set its default to 0 so the module only runs if a non-zero value is provided
This commit was SVN r23609.
2010-08-13 15:03:45 +00:00
Rainer Keller
fc4cb0c0c1 - Allow changing ALPS run command
- Fix misnomer

This commit was SVN r23601.
2010-08-12 14:41:35 +00:00
Ralph Castain
5715a5b421 Let VM-based mappings include the updated nidmap
This commit was SVN r23596.
2010-08-11 21:04:28 +00:00
Josh Hursey
e12ca48cd9 A number of C/R enhancements per RFC below:
http://www.open-mpi.org/community/lists/devel/2010/07/8240.php

Documentation:
  http://osl.iu.edu/research/ft/

Major Changes: 
-------------- 
 * Added C/R-enabled Debugging support. 
   Enabled with the --enable-crdebug flag. See the following website for more information: 
   http://osl.iu.edu/research/ft/crdebug/ 
 * Added Stable Storage (SStore) framework for checkpoint storage 
   * 'central' component does a direct to central storage save 
   * 'stage' component stages checkpoints to central storage while the application continues execution. 
     * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) 
     * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) 
 * Added Compression (compress) framework to support 
 * Add two new ErrMgr recovery policies 
   * {{{crmig}}} C/R Process Migration 
   * {{{autor}}} C/R Automatic Recovery 
 * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component 
 * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) 
   * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) 
   * {{{OMPI_CR_Restart}}} 
   * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) 
   * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) 
   * {{{OMPI_CR_Quiesce_start}}} 
   * {{{OMPI_CR_Quiesce_checkpoint}}} 
   * {{{OMPI_CR_Quiesce_end}}} 
   * {{{OMPI_CR_self_register_checkpoint_callback}}} 
   * {{{OMPI_CR_self_register_restart_callback}}} 
   * {{{OMPI_CR_self_register_continue_callback}}} 
 * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. 
 * Add a progress meter to: 
   * FileM rsh (filem_rsh_process_meter) 
   * SnapC full (snapc_full_progress_meter) 
   * SStore stage (sstore_stage_progress_meter) 
 * Added 2 new command line options to ompi-restart 
   * --showme : Display the full command line that would have been exec'ed. 
   * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) 
 * Deprecated some MCA params: 
   * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir 
   * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir 
   * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared 
   * snapc_base_store_in_place deprecated, replaced with different components of SStore 
   * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref 
   * snapc_base_establish_global_snapshot_dir deprecated, never well supported 
   * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem 

Minor Changes: 
-------------- 
 * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. 
 * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components 
 * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. 
 * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} 
 * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. 
 * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. 
 * Cleanup the CRS framework and components to work with the SStore framework. 
 * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). 
 * Add 'quiesce' hook to CRCP for a future enhancement. 
 * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. 
 * Add optional application level INC callbacks (registered through the CR MPI Ext interface). 
 * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. 
 * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. 
 * {{{opal-restart}}} also support local decompression before restarting 
 * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata 
 * {{{orte-restart}}} now uses the SStore framework to work with the metadata 
 * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. 
 * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. 
 * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. 
 * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. 
 * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options. 
 * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. 
 * Improve the checks for 'already checkpointing' error path. 
 * A a recovery output timer, to show how long it takes to restart a job 
 * Do a better job of cleaning up the old session directory on restart. 
 * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) 
 * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. 

This commit was SVN r23587.

The following Trac tickets were found above:
  Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
  Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
  Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
  Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
  Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
  Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
  Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
2010-08-10 20:51:11 +00:00
Terry Dontje
b74ef351b7 Added new solaris sysinfo module. Also added code to assign
orte_local_chip_type and orte_local_chip_model in MPI processes it the
appropriate sysinfo module found the values on the machine.

This commit was SVN r23581.
2010-08-09 19:28:56 +00:00
Ralph Castain
9fbd7c1949 Fix a bug in tcp multicast
This commit was SVN r23547.
2010-08-04 01:37:54 +00:00
Ralph Castain
94f046e7f3 Fix an "oops" where we missed some instances of a name change. Thanks to Greg Koenig for spotting it!
This commit was SVN r23545.
2010-08-03 16:36:08 +00:00
Josh Hursey
01fbb729e3 Need to get the nidmap so that the apps have something to look at for wireup.
Without this we get errors like the following since the nidmap is empty in the apps:
[[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 234
[[13094,1],1] ORTE_ERROR_LOG: Not found in file ess_env_module.c at line 281

This commit was SVN r23536.
2010-07-31 16:25:13 +00:00
Josh Hursey
ba7e94dd89 Some relatively minor C/R related cleanup
* Fix a configure warning for checking --enable-ft-thread
 * In hnp and orted ErrMgr components check to see if other components have already recovered this process before trying to recover it again.
 * Fix 'npernode' for restarting using the resilient rmaps component
 * export ompi_info_set, so that internal functionality can use it.

This commit was SVN r23535.
2010-07-30 18:59:34 +00:00
Ralph Castain
d8ec83f939 Remove an unneeded tag
This commit was SVN r23533.
2010-07-29 02:13:06 +00:00
Ralph Castain
94d4887452 Cleanup some race conditions at job start
This commit was SVN r23515.
2010-07-27 18:24:11 +00:00
Ralph Castain
97cfbdfc8c Add debug
This commit was SVN r23514.
2010-07-27 16:55:16 +00:00
Ralph Castain
cfbfbb75a2 Don't abort if a race condition causes a nid not to be found
This commit was SVN r23513.
2010-07-27 16:20:58 +00:00
Ralph Castain
06fe2c4c20 Add debug
This commit was SVN r23512.
2010-07-27 16:20:29 +00:00
Ralph Castain
e16ec40283 Check the correct envar. Thanks to Philippe for pointing out the error!
This commit was SVN r23499.
2010-07-27 03:49:03 +00:00
Ralph Castain
0c201486e4 Allow for multiple channels for the same channel "name" - could be both input and output channels
This commit was SVN r23498.
2010-07-27 01:38:39 +00:00
Ralph Castain
6158cab2e1 Add missing header
This commit was SVN r23497.
2010-07-27 01:37:41 +00:00
Ralph Castain
1ba8bbe1a9 Don't require the existence of a multicast interface if --enable-multicast wasn't specified.
This commit was SVN r23494.
2010-07-26 15:09:57 +00:00
Ralph Castain
e858b029aa Re-establish support for IBM's POE environment by resurrecting the poe launcher from the v1.2 series, updated for the current trunk. Modify the hnp errmgr module to support batch-operated jobs. Update the new ess generic module to support a wider range of mapping options.
This commit was SVN r23493.
2010-07-24 23:35:20 +00:00
Ralph Castain
ff2d573f7e Adjust some default values, and ensure we don't start sending too soon
This commit was SVN r23492.
2010-07-23 19:37:16 +00:00
Ralph Castain
140e427a79 Ensure that wildcard recvs go to the end of the matching list so that recvs for specific tags take precedence.
Ensure we don't try to send tcp mcast messsages to procs that haven't reported back yet

This commit was SVN r23491.
2010-07-23 19:31:34 +00:00
Ralph Castain
95a908cb27 Provide run-time control over heartbeat
This commit was SVN r23490.
2010-07-23 17:11:12 +00:00