openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	e858b029aa	Re-establish support for IBM's POE environment by resurrecting the poe launcher from the v1.2 series, updated for the current trunk. Modify the hnp errmgr module to support batch-operated jobs. Update the new ess generic module to support a wider range of mapping options. This commit was SVN r23493.	2010-07-24 23:35:20 +00:00
Ralph Castain	72525a5850	Allow the passing of NULL to orte_plm.kill_local_procs to match what we allow for the equivalent orte_odls call This commit was SVN r23436.	2010-07-20 04:05:41 +00:00
Ralph Castain	12cd07c9a9	Start reducing our dependency on the event library by removing at least one instance where we use it to redirect the program counter. Rolf reported occasional hangs of mpirun in very specific circumstances after all daemons were done. A review of MTT results indicates this may have been happening more generally in a small fraction of cases. The problem was tracked to use of the grpcomm.onesided_barrier to control daemon/mpirun termination. This relied on messaging -and- required that the program counter jump from the errmgr back to grpcomm. On rare occasions, this jump did not occur, causing mpirun to hang. This patch looks more invasive than it is - most of the affected files simply had one or two lines removed. The essence of the change is: * pulled the job_complete and quit routines out of orterun and orted_main and put them in a common place * modified the errmgr to directly call the new routines when termination is detected * removed the grpcomm.onesided_barrier and its associated RML tag * add a new "num_routes" API to the routed framework that reports back the number of dependent routes. When route_lost is called, the daemon's list of "children" is checked and adjusted if that route went to a "leaf" in the routing tree * use connection termination between daemons to track rollup of the daemon tree. Daemons and HNP now terminate once num_routes returns zero Also picked up in this commit is the addition of a new bool flag to the app_context struct, and increasing the job_control field from 8 to 16 bits. Both trivial. This commit was SVN r23429.	2010-07-17 21:03:27 +00:00
Jeff Squyres	c98729694a	Fix a compile error in the xgrid plm. This does nothing to make it work; it just now compiles on os x. This commit was SVN r23387.	2010-07-13 11:56:20 +00:00
Ralph Castain	2babebf9c3	Revert some of a prior commit - there is no need for another flag to indicate one-sided termination of the orteds. This commit was SVN r23372.	2010-07-10 03:33:01 +00:00
Ralph Castain	da61b69b15	Ensure we don't incorrectly return a non-zero exit code when normally terminating a slurm job. Slurm, of course, must always be different... This commit was SVN r23371.	2010-07-09 19:14:10 +00:00
Ralph Castain	81a65f2c67	Define a new node state This commit was SVN r23332.	2010-07-01 19:38:23 +00:00
Ralph Castain	f5548b8e0f	remove a potential locking conflict, and let emacs go ahead and reformat the function (sigh) This commit was SVN r23331.	2010-07-01 19:37:53 +00:00
Ralph Castain	d463aec2f6	Don't try to send to dead daemons, keep accounting straight so we don't hang This commit was SVN r23330.	2010-07-01 19:37:02 +00:00
Ralph Castain	099c3aad97	Fix a major foopah that broke debugger attach. With the revisions in updating proc state, we dropped the recording of each proc's pid. Thus, attaching debuggers would find a proctable whose pids all equal 0. This required modification of the errmgr.update_state API so the pid could be passed in to the function that could update the proper data record(s). All calls to that API have been updated as well, but I obviously couldn't test them all. Thanks to Dong Ahn (LLNL) for catching this problem! Also fixed debugger daemon cospawn, both for initial launch and attach-while-running modes. Tested and verified on rsh and slurm. This commit was SVN r23300.	2010-06-24 05:13:53 +00:00
Ralph Castain	8b2a682fba	Return a silent error when -do-not-launch is given This commit was SVN r23291.	2010-06-22 01:06:10 +00:00
Ethan Mallove	fc37e408c2	Avoid SEGV in case rsh/ssh is not in `PATH` (refs trac:1490) This commit was SVN r23278. The following Trac tickets were found above: Ticket 1490 --> https://svn.open-mpi.org/trac/ompi/ticket/1490	2010-06-17 14:58:09 +00:00
Ralph Castain	da43547983	Don't define the active_jobid until -after- the job has been setup. Cleanup references to pointer_array objects This commit was SVN r23250.	2010-06-09 02:16:05 +00:00
Ralph Castain	e52a54183f	Let max restarts be associated with an app_context instead of a job so that individual apps can have different values. Default to a single job-level value This commit was SVN r23248.	2010-06-07 14:21:08 +00:00
George Bosilca	f453265de2	Only call gettimeofday once. This commit was SVN r23235.	2010-06-02 09:44:37 +00:00
Ralph Castain	73ebb748bb	Ignore comm failures when shutting down orteds This commit was SVN r23201.	2010-05-23 02:57:03 +00:00
Ralph Castain	e8f98661bb	Fix a couple of plm modules that were calling a stale function This commit was SVN r23200.	2010-05-23 02:55:47 +00:00
Shiqing Fan	857f1669e2	Solve a few compilation problems on Windows. This commit was SVN r23193.	2010-05-21 14:30:15 +00:00
Abhishek Kulkarni	afbe3e99c6	* Wrap all the direct error-code checks of the form (OMPI_ERR_* == ret) with (OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns back the native error code. * Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form (OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to decode 'ret' to get the native error code. This commit was SVN r23162.	2010-05-17 23:08:56 +00:00
Ralph Castain	de85049477	Cleanup warnings This commit was SVN r23149.	2010-05-16 20:22:18 +00:00
Ralph Castain	88f5217a12	Cleanup the debugger daemon co-launch code and add an ability to test it. Implement ability to co-launch debugger daemons upon attach to a running job for jobs launched under rsh, slurm, and tm environments (others can easily be added if desired). Add new mca params to test: orte_debugger_test_daemon: Name of the executable to be used to simulate a debugger colaunch orte_debugger_test_attach: Test debugger colaunch after debugger attachment To test co-launch at job start, just set the orte_debugger_test_daemon param. To test co-launch upon attach: set orte_debugger_test_daemon set orte_debugger_test_attach=1 set orte_enable_debug_cospawn_while_running=1 set orte_debugger_check_rate=<N> - defines the number of seconds to wait before "checking" for a debugger attaching Added a "debugger" program to orte/test/mpi that just spins to simulate a debugger daemon. This commit was SVN r23144.	2010-05-14 18:44:49 +00:00
Ralph Castain	b9f0615727	Correct some logic for tracking launch progress This commit was SVN r23122.	2010-05-12 18:39:10 +00:00
Jeff Squyres	477201e161	Fix "make dist" breakage This commit was SVN r23105.	2010-05-06 18:47:20 +00:00
Shiqing Fan	76726f6094	Update a few module configuration for Windows. This commit was SVN r23104.	2010-05-05 12:22:04 +00:00
Ralph Castain	2ff1ae13e1	Create a new "heartbeat" module in the sensor framework and move the plm_base heartbeat code there. Add new proc and job states for heartbeat_failed. Remove the "heartbeat" cmd line option for orted as this is now done automatically if the --enable-heartbeat configure option is set. This commit was SVN r23102.	2010-05-05 00:48:43 +00:00
Ralph Castain	43005b88e0	Catch one more spot...thanks to Sam for reporting it This commit was SVN r23094.	2010-05-04 17:21:15 +00:00
Ralph Castain	323224b84b	Allow the restart data to be retrieved from a job's app->env or default that given to orterun This commit was SVN r23077.	2010-05-03 04:05:19 +00:00
Ralph Castain	319758e3e0	Restore process recovery for procs local to mpirun (first step towards restoring full capability). Define three new MCA params: 1. orte_enable_recovery - default recovery policy, can be overridden on a per-job basis 2. orte_max_local_restarts - default max number of local restarts, can be overridden 3. orte_max_global_restarts - default max number of relocates, can be overridden Implement the restart_proc API for the ODLS framework, reorganize the default fns a little to avoid copying code. This commit was SVN r23057.	2010-04-28 04:06:57 +00:00
Ralph Castain	401fb22847	Fix the pack/unpack of process state values by getting the actual packing size to match the new value size. Improve the debug output by printing the string This commit was SVN r23048.	2010-04-27 15:58:04 +00:00
Ralph Castain	efc7d3e2f6	Update the LSF PLM module to ensure that the path is correctly passed. Many thanks to Teng Lin for the error report - and the patch! This commit was SVN r23047.	2010-04-27 03:46:04 +00:00
Ralph Castain	55889934d8	After hours spent chasing the stupid "abort" file, it became clear that we were always going to be plagued by that idiot contraption when trying to be good citizens and properly cleanup. So get rid of it by instead doing a messaging handshake with the local daemon. Note that this isn't a problem since MPI_Abort and orte_abort are only called under controlled circumstances - i.e., we are doing an orderly abort and not segfaulting. If we can't get the message out for some reason, then too bad - we'll still see an abnormal process termination and act accordingly. This commit was SVN r23045.	2010-04-27 03:39:32 +00:00
Ralph Castain	b9893aacc5	Add a sensor framework to ORTE that monitors applications and notifies the errmgr when they exceed specified boundaries. Two modules are included here: 1. file activity - can monitor file size, access and modification times. If these fail to change over a specified number of sampling iterations (rate is an mca param), then the errmgr is notified. 2. memory usage - checks amount of memory used by a process. Limit and sampling rate can be set. This support must be enabled by configuring --enable-sensors. ompi_info and orte-info have been updated to include the new framework. Also includes some initial steps toward restoring the recovery capability. Most notably, the ODLS API has been extended to include a "restart_proc" entry for restarting a local process, and organizes the various ERRMGR framework globals into a single struct as we do in the other ORTE frameworks. Fix an oversight in the ERRMGR framework where a pointer array was constructed, but not initialized. Implementation continues. This commit was SVN r23043.	2010-04-26 22:15:57 +00:00
Ralph Castain	43a89bbace	Extend the process and job states by adding values for exceeding sensor bounds. This changes the job state field to 32-bit to also provide room for future expansion. This commit was SVN r23036.	2010-04-26 12:36:40 +00:00
Ralph Castain	9c69900bbc	Ensure we get the jdata object when updating proc state so we don't segv if report-progress is set. Thanks to Sam for reporting it! This commit was SVN r23029.	2010-04-23 15:14:21 +00:00
Ralph Castain	efbb5c9b7c	Revamp the errmgr framework to provide a greater range of optional behaviors, including different behaviors for daemons, and remove several looping messages across the code base: * add hnp and orted modules to the errmgr framework. The HNP module contains much of the code that was in the errmgr base since that code could only be executed by the HNP anyway. * update the odls to report process states directly into the active errmgr module, thus removing the need to send messages looped back into the odls cmd processor. Let the active errmgr module decide what to do at various states. * remove the code to track application state progress from the plm_base_launch_support.c code. Update the plm modules to call the errmgr directly when a launch fails. * update the plm_base_receive.c code to call the errmgr with state updates from remote daemons * update the routed modules to reflect that process state is updated in the errmgr * ensure that the orted's open the errmgr and select their appropriate module * add new pretty-print utilities to print process and job state. Move the pretty-print of time info to a globally-accessible place * define a global orte_comm function to send messages from orted's to the HNP so that others can overlay the standard RML methods, if desired. * update the orterun help output to reflect that the "term w/o sync" error message can result from three, not two, scenarios This commit was SVN r23023.	2010-04-23 04:44:41 +00:00
Ralph Castain	8da781af84	Continue developing support for distributed virtual machines - minor changes to ensure correct jobid gets used and that dvm's can communicate with tools This commit was SVN r22958.	2010-04-12 22:33:09 +00:00
Ralph Castain	1caba7af2f	Fix a bunch of compiler warnings reported by Jeff This commit was SVN r22930.	2010-04-03 00:20:19 +00:00
Ralph Castain	f6bfaa76ba	Add some debug output to job_complete. If no session dirs were created, then cannot check for abort file - which wouldn't be created anyway This commit was SVN r22903.	2010-03-29 23:21:03 +00:00
Josh Hursey	e4f2d03d28	ErrMgr Framework redesign to better support fault tolerance development activities. Explained in more detail in the following RFC: http://www.open-mpi.org/community/lists/devel/2010/03/7589.php This commit was SVN r22872.	2010-03-23 21:28:02 +00:00
Josh Hursey	e9b5162d79	Fix the configure logic for --with-ft so that it properly takes a comma separated list. Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those. The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed. This commit was SVN r22824.	2010-03-12 23:57:50 +00:00
Josh Hursey	b73237c92a	Identify the process sending the update in the verbose message (helps debugging of process control). This commit was SVN r22804.	2010-03-10 00:23:24 +00:00
Shiqing Fan	db747e4390	Remove the old timing parameter but using orte_timing instead. Thanks for Rainer. This commit was SVN r22775.	2010-03-04 15:00:03 +00:00
Ralph Castain	bef06d52bc	Silence compiler warning This commit was SVN r22747.	2010-03-01 21:04:26 +00:00
Ralph Castain	6c0d7940c7	Add a new MCA param (and corresponding mpirun cmd line option) to output the debugger proctable info after launch. The output is just the job map with the process pid included, so you get a node-by-node list of the process ranks on that node and thier pids. Works for initial launch and comm_spawn. xml and non-xml output is available This commit was SVN r22725.	2010-02-27 08:32:25 +00:00
Shiqing Fan	4a3f42d159	Correctly initialize the CCP command line buffer. This commit was SVN r22721.	2010-02-26 15:53:00 +00:00
Shiqing Fan	7a5a5ce024	Add a global option for inputing the head node name for Windows CCP trough command line. This commit was SVN r22688.	2010-02-23 19:42:51 +00:00
Josh Hursey	a3583b8f57	Fix --bynode option to remember for subsequent jobs where it left off last time. Add a ''map_bynode'' info key to determine if the job to be started by comm_spawn* should be mapped by node or by slot. Default is to map according to the default policy set when the parent job was started. cmr:v1.5.1 This commit was SVN r22564.	2010-02-05 15:37:49 +00:00
Iain Bason	28f03a2d86	Suspend/resume enhancements: Have orte call setpgrp after forking (but before exec) when orte_forward_job_control is set. Then have it send signals to the child's process group. This allows suspending jobs that fork. If a SIGTSTP arrives before the processes have been launched, then record it and suspend them right after launching. This commit was SVN r22557.	2010-02-04 15:47:20 +00:00
Shiqing Fan	bbcf1f71c4	Remove a incorrect callback, which was based on the old source base without WMI. This makes no harm on 32 bit Windows, but it seems causing exceptions on 64 bit Windows sometimes. What this callback does is just waiting on the given pid which actually is a remote pid, so it won't work as expected. cmr:v1.4.2 cmr:v1.5 This commit was SVN r22549.	2010-02-04 10:47:28 +00:00
Ralph Castain	f66b6cae23	Enable the boot of an orted "virtual machine". Modify the mapper framework to allow mapping of only daemons. Remove the cm ras module as no longer required. Modify the orted code to always send back node arch info. Remove the "--enable-bootstrap" configure option as this feature will now always be available. This commit was SVN r22480.	2010-01-25 22:25:13 +00:00

1 2 3 4 5 ...

327 Коммитов