openmpi

Автор	SHA1	Сообщение	Дата
Josh Hursey	4ffc2d6f68	fix a couple of missed prefixes This commit was SVN r23629.	2010-08-19 13:26:33 +00:00
Josh Hursey	fabd5cc153	Simplification of the ErrMgr framework by removing the 'stack'/composite functionality. The composite functionality was becoming difficult to maintain, so we removed it for now which simplifies the framework design considerably. Since the 'crmig' and 'autor' components were -very- similar to the 'hnp' component, this commit also merges them together. By moving the 'crmig' and 'autor' to a separate file under the 'hnp' component we are able to isolate the C/R logic to a large extent, thus being only minimally hooked into the previous 'hnp' component. So other than some name changes, the functionality is all still in place. I will update the C/R documentation later this morning. This commit was SVN r23628.	2010-08-19 13:09:20 +00:00
Josh Hursey	e12ca48cd9	A number of C/R enhancements per RFC below: http://www.open-mpi.org/community/lists/devel/2010/07/8240.php Documentation: http://osl.iu.edu/research/ft/ Major Changes: -------------- * Added C/R-enabled Debugging support. Enabled with the --enable-crdebug flag. See the following website for more information: http://osl.iu.edu/research/ft/crdebug/ * Added Stable Storage (SStore) framework for checkpoint storage * 'central' component does a direct to central storage save * 'stage' component stages checkpoints to central storage while the application continues execution. * 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress) * 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching) * Added Compression (compress) framework to support * Add two new ErrMgr recovery policies * {{{crmig}}} C/R Process Migration * {{{autor}}} C/R Automatic Recovery * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option) * {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342) * {{{OMPI_CR_Restart}}} * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) * {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192) * {{{OMPI_CR_Quiesce_start}}} * {{{OMPI_CR_Quiesce_checkpoint}}} * {{{OMPI_CR_Quiesce_end}}} * {{{OMPI_CR_self_register_checkpoint_callback}}} * {{{OMPI_CR_self_register_restart_callback}}} * {{{OMPI_CR_self_register_continue_callback}}} * The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future. * Add a progress meter to: * FileM rsh (filem_rsh_process_meter) * SnapC full (snapc_full_progress_meter) * SStore stage (sstore_stage_progress_meter) * Added 2 new command line options to ompi-restart * --showme : Display the full command line that would have been exec'ed. * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413) * Deprecated some MCA params: * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir * snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared * snapc_base_store_in_place deprecated, replaced with different components of SStore * snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref * snapc_base_establish_global_snapshot_dir deprecated, never well supported * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem Minor Changes: -------------- * Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing. * Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components * Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it. * Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} * Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set. * opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality. * Cleanup the CRS framework and components to work with the SStore framework. * Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably). * Add 'quiesce' hook to CRCP for a future enhancement. * We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}. * Add optional application level INC callbacks (registered through the CR MPI Ext interface). * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive. * {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked. * {{{opal-restart}}} also support local decompression before restarting * {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata * {{{orte-restart}}} now uses the SStore framework to work with the metadata * Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality. * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}. * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped. * Make sure to decrement the number of 'num_local_procs' in the orted when one goes away. * odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary\|files] options. * Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities. * Improve the checks for 'already checkpointing' error path. * A a recovery output timer, to show how long it takes to restart a job * Do a better job of cleaning up the old session directory on restart. * Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment) * Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize. This commit was SVN r23587. The following Trac tickets were found above: Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924 Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097 Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161 Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192 Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208 Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342 Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413	2010-08-10 20:51:11 +00:00
Josh Hursey	ba7e94dd89	Some relatively minor C/R related cleanup * Fix a configure warning for checking --enable-ft-thread * In hnp and orted ErrMgr components check to see if other components have already recovered this process before trying to recover it again. * Fix 'npernode' for restarting using the resilient rmaps component * export ompi_info_set, so that internal functionality can use it. This commit was SVN r23535.	2010-07-30 18:59:34 +00:00
Ralph Castain	e858b029aa	Re-establish support for IBM's POE environment by resurrecting the poe launcher from the v1.2 series, updated for the current trunk. Modify the hnp errmgr module to support batch-operated jobs. Update the new ess generic module to support a wider range of mapping options. This commit was SVN r23493.	2010-07-24 23:35:20 +00:00
Ralph Castain	12cd07c9a9	Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases. The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang. This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is: * pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place * modified the errmgr to directly call the new routines when termination is detected * removed the grpcomm.onesided_barrier and its associated RML tag * add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree * use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial. This commit was SVN r23429.	2010-07-17 21:03:27 +00:00
Ralph Castain	2babebf9c3	Revert some of a prior commit - there is no need for another flag to indicate one-sided termination of the orteds. This commit was SVN r23372.	2010-07-10 03:33:01 +00:00
Ralph Castain	da61b69b15	Ensure we don't incorrectly return a non-zero exit code when normally terminating a slurm job. Slurm, of course, must always be different... This commit was SVN r23371.	2010-07-09 19:14:10 +00:00
Ralph Castain	31295e8dc2	As discussed on today's telecon, reorganize the debugger attachment code in orte to better support efforts within the tool community aimed at exploring alternative methods. Move the debugger attachment code from the orterun directory to a new debugger framework. Organize the existing standard support code into an "mpir" component. Organize the current extensions for co-spawning debugger daemons into a separate "mpirx" component. Since the MPIR symbols are now included in the ORTE library, remove duplicate declarations in OMPI and replace them with extern references to their ORTE instantiations. This commit was SVN r23360.	2010-07-06 23:35:42 +00:00
Jeff Squyres	5bebdb97fa	Need these header files for NetBSD. Thanks for the heads-up from Aleksej Saushev. This commit was SVN r23343.	2010-07-02 17:38:57 +00:00
Ralph Castain	ee4564c13b	Add some useful debug This commit was SVN r23339.	2010-07-02 03:35:47 +00:00
Ralph Castain	f3d90dfb8d	Fully restore fault recovery, both at the individual process and daemon level. NOTE: MPI fault recovery remains unavailable pending merge from Josh. This only covers ORTE-level processes. This commit was SVN r23335.	2010-07-01 19:45:43 +00:00
Ralph Castain	3237b9ec87	Print a nice error message when a daemon fails, and exit with a non-zero status This commit was SVN r23314.	2010-06-28 16:38:54 +00:00
Ralph Castain	099c3aad97	Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0. This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all. Thanks to Dong Ahn (LLNL) for catching this problem! Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm. This commit was SVN r23300.	2010-06-24 05:13:53 +00:00
Ralph Castain	e52a54183f	Let max restarts be associated with an app_context instead of a job so that individual apps can have different values. Default to a single job-level value This commit was SVN r23248.	2010-06-07 14:21:08 +00:00
Abhishek Kulkarni	f04dcffecd	Wrap the connection failed check with a SOS macro to extract the native error code. This commit was SVN r23202.	2010-05-23 16:42:08 +00:00
Ralph Castain	73ebb748bb	Ignore comm failures when shutting down orteds This commit was SVN r23201.	2010-05-23 02:57:03 +00:00
Ralph Castain	7c43d6c0f5	Don't drop a core file when we abort due to a lost connection This commit was SVN r23199.	2010-05-22 18:09:40 +00:00
Ralph Castain	ef3c88cbd2	If we have ordered jobs to terminate, then we should ignore comm_failed reports from daemons as they may be dropping out This commit was SVN r23185.	2010-05-20 12:37:09 +00:00
Ralph Castain	05e05089b8	Ignore failed comm connections if it is our connection that failed This commit was SVN r23184.	2010-05-20 03:13:09 +00:00
Ralph Castain	c7d7a18318	Little more cleanup from SOS This commit was SVN r23175.	2010-05-19 16:28:58 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Abhishek Kulkarni	f7f4dd87ab	Change ORTE_ERR_LOG macro to make it SOS-aware. If the logged error is an SOS-encoded error, use the SOS log function otherwise use the existing errmgr log function. This commit was SVN r23161.	2010-05-17 23:02:13 +00:00
Ralph Castain	7e6985edbf	Cleanup warnings This commit was SVN r23150.	2010-05-16 20:23:26 +00:00
Ralph Castain	b9f0615727	Correct some logic for tracking launch progress This commit was SVN r23122.	2010-05-12 18:39:10 +00:00
Ralph Castain	7ce34223f1	Per off-list discussion, implement the new OMPI exit status policy (soon to be on wiki) and further cleanup error reporting to cover new cases. Implement process migration when failed nodes are detected. Some testing still required This commit was SVN r23121.	2010-05-12 18:11:58 +00:00
Ralph Castain	4bd25f587c	Begin handling the case of lost connections by having the OOB report it to the errmgr instead of the routed framework. Add an "app" component to t he errmgr framework so that it can decide how to respond - which for now at least is just to check for lifeline and abort if so. Add a new error constant to indicate that the error is "unrecoverable" so the oob can know it needs to abort. This commit was SVN r23112.	2010-05-11 00:34:12 +00:00
Ralph Castain	86033074ce	Grr...it's a job state, not a proc state This commit was SVN r23110.	2010-05-07 16:50:10 +00:00
Ralph Castain	70bef5ca95	It is possible for the jdata param to be NULL, so protect the verbose output in that case This commit was SVN r23109.	2010-05-07 16:48:44 +00:00
Shiqing Fan	8ea8563462	Include the new component config file for Windows into tarball. This commit was SVN r23108.	2010-05-07 14:30:12 +00:00
Ralph Castain	d6a1d7a082	Little more cleanup on paffinity. Provide a specific error code for affinity not supported so we can better report the problem. Move the error reporting to orterun so we only get one error message. Update the darwin paffinity module to return the correct new error codes. This commit was SVN r23107.	2010-05-07 14:04:55 +00:00
Shiqing Fan	76726f6094	Update a few module configuration for Windows. This commit was SVN r23104.	2010-05-05 12:22:04 +00:00
Ralph Castain	2ff1ae13e1	Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set. This commit was SVN r23102.	2010-05-05 00:48:43 +00:00
Ralph Castain	cd569f8a79	Restore the global restart capability This commit was SVN r23089.	2010-05-04 02:40:29 +00:00
Ralph Castain	5103ffead6	Continue work on local recovery for orteds This commit was SVN r23080.	2010-05-03 04:08:13 +00:00
Ralph Castain	c93af95351	Since there is no defined behavior for the case where all application procs exit normally, but some or all have non-zero returns, just output a warning telling the user how many procs meet that criteria. Let the return code of mpirun in that scenario reflect any errors in OMPI/ORTE itself. Clearly a temporary solution until a defined behavior can be established. This commit was SVN r23075.	2010-04-30 19:01:10 +00:00
Ralph Castain	c62418d76d	Add a new test that checks behavior when we call exit with a non-zero return code after calling finalize - don't ask why. Modify the check_complete code so it finds the first non-zero exit status (i.e., the one from the lowest rank) in a job that terminates normally, and sets the mpirun exit code to that status. This commit was SVN r23071.	2010-04-29 19:58:44 +00:00
Ralph Castain	5689bd6d30	Remove debug This commit was SVN r23062.	2010-04-28 22:14:23 +00:00
Ralph Castain	319758e3e0	Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params: 1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis 2. orte_max_local_restarts - default max number of local restarts, can be overridden 3. orte_max_global_restarts - default max number of relocates, can be overridden Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code. This commit was SVN r23057.	2010-04-28 04:06:57 +00:00
Ralph Castain	55889934d8	After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon. Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly. This commit was SVN r23045.	2010-04-27 03:39:32 +00:00
Ralph Castain	b9893aacc5	Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here: 1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified. 2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set. This support must be enabled by configuring --enable-sensors. ompi_info and orte-info have been updated to include the new framework. Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized. Implementation continues. This commit was SVN r23043.	2010-04-26 22:15:57 +00:00
Ralph Castain	e3164d2ac1	For the autogen-challenged (i.e., Jeff), create a new ORTE constant that tells the user that a required module was not found. Update the errmgr select function to output the error if no module is found. This commit was SVN r23032.	2010-04-24 01:39:26 +00:00
Ralph Castain	efbb5c9b7c	Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base: * add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway. * update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states. * remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails. * update the plm_base_receive.c code to call the errmgr with state updates from remote daemons * update the routed modules to reflect that process state is updated in the errmgr * ensure that the orted's open the errmgr and select their appropriate module * add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place * define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired. * update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios This commit was SVN r23023.	2010-04-23 04:44:41 +00:00
Ralph Castain	a1e82e9d05	Per discussion with Josh, cleanup the errmgr API by creating separate modules for the public vs internal APIs. This mirrors the architecture used in other frameworks that had similar requirements. Remove the orcm errmgr module - moving to the orcm code base so it can utilize orcm communications and not interfere with ompi-related operations. This commit was SVN r22931.	2010-04-05 22:59:21 +00:00
Ralph Castain	1caba7af2f	Fix a bunch of compiler warnings reported by Jeff This commit was SVN r22930.	2010-04-03 00:20:19 +00:00
Ralph Castain	522a23d6a3	A few changes to the FT-related configure options: 1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option 2. add want-ft=orcm so we can build the orcm errmgr component 3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved This commit was SVN r22885.	2010-03-25 22:53:48 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Ralph Castain	f2c65dc70f	Ensure that the errmgr does not take action if the process was terminated by a "kill_procs" command as this can lead to circular logic. Cleanup the kill_procs command by removing a no-longer-used param. We update the process state when the proc actually exits. This commit was SVN r22783.	2010-03-05 13:22:12 +00:00
Ralph Castain	cd1efbb41e	Try and do a better job of cleanup in abnormal termination. Ensure the daemons whack session directories prior to disabling signal traps. Ensure that the HNP and daemons all cleanup when they are doing an internal abort. This commit was SVN r22755.	2010-03-02 14:51:23 +00:00
Shiqing Fan	872a4047ba	Fix the bug that caused by ADD_DEPENDENCIES() from different version of CMake. In CMake 2.6 and earlier, this function add dependencies for targets and also link the target libraries automatically, but in CMake 2.8,this behavior has been changed, i.e. it will only add the dependencies but no link, which will cause linking errors at compilation time. This commit was SVN r22405.	2010-01-14 18:10:20 +00:00

1 2 3

135 Коммитов