openmpi

Автор	SHA1	Сообщение	Дата
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Ralph Castain	b6196e8a39	When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed. In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean. I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it. This commit was SVN r16451.	2007-10-15 18:00:30 +00:00
George Bosilca	d658a477af	Update the help file to match the real name of the required argument. This commit was SVN r15762.	2007-08-04 00:35:55 +00:00
Jeff Squyres	64083570f5	Add support for DDT parallel debugger, which required several things: * Making some symbols and types be global (vs. static) in orterun * Adding a "ddt" entry in the MCA parameter orte_base_user_debugger default value * Add support for @executable@, @executable_argv@, and @single_app@ tokens in the orte_base_user_debugger MCA parameter. * Added various error checks and corresponding help messages after finding a debugger in the PATH Fixes trac:1081 This commit was SVN r15323. The following Trac tickets were found above: Ticket 1081 --> https://svn.open-mpi.org/trac/ompi/ticket/1081	2007-07-10 12:53:48 +00:00
Ralph Castain	c774f641fb	Modify orterun to provide more user-friendly reporting on jobs that fail to start This commit was SVN r14496.	2007-04-24 19:19:14 +00:00
Jeff Squyres	8d872b195a	Refs trac:726 Tested this functionality quite a bit more and made some fixes: * Print far fewer help messages * Fix one additional deadlock upon error * Change some ORTE_LOG messages to silent (because they're not errors) * Some code got re-indented, sorry... Discussed and reviewed with Ralph. This commit was SVN r13375. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-30 23:03:13 +00:00
Ralph Castain	ab5ea61100	Bring over the rest of the ctrl-c fixes. This commit includes: 1. add a "cancel_operation" API to the pls components that allows orterun to demand that an orted operation (e.g., terminate_job) be immediately cancelled and abandoned. 2. changes the pls orted commands from blocking to non-blocking. This allows us to interrupt those operations should an orted be non-responsive. The change also adds an orte_abort_timeout that limits how long orterun will automatically wait for the orteds to respond - if the terminate command, for example, doesn't see orted response within that time, then we printout an appropriate error message and just give up. 3. modifies orterun to allow multiple ctrl-c's to simply abort the program even if the orteds have not responded 4. does some cleanup on the orte-level mca params so that their implementation looks a lot more like that of ompi - makes it easier to maintain. This change also includes the definition of an orte_abort_timeout struct and associated MCA param (can't have too many!) so you can set the time after which orterun gives up on waiting for orteds to respond This needs more testing before migrating to 1.2. This commit was SVN r13304.	2007-01-25 14:17:44 +00:00
Jeff Squyres	8a289cf1cb	Part 1 of the fix for ticket #726 . This commit adds logic to orteun to effect the following: * The first time the user hits ctrl-c, we go into the process of killing the ORTE job (this is not new). * While waiting for the job to actually terminate, if the user hits ctrl-c a second time, we print a warning saying "Hey, I'm still trying to kill the job. If you really want me to die immediately, hit ctrl-c again within 1 second." * If the user hits ctrl-c a within 1 second, orterun quits with a warning about how the job may not have actually been killed. Note that none of this logic won't really work until the second part of the fix for #726 is also committed (i.e., make pls.terminate_job() non-blocking). So I'm now throwing the ticket over to Ralph for the second part of the fix... Refs trac:726 This commit was SVN r13040. The following Trac tickets were found above: Ticket 726 --> https://svn.open-mpi.org/trac/ompi/ticket/726	2007-01-08 20:25:26 +00:00
Brian Barrett	bc6cec346f	Print out the description of the signal from mpirun when a proc was aborted by a signal if we have strsignal() This commit was SVN r12888.	2006-12-17 20:01:11 +00:00
Ralph Castain	30de73a712	Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include: 1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch" 2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command. 3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job 4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map. Enjoy! My personal favorite is the first one - helps with debugging. This commit was SVN r12379.	2006-10-31 22:16:51 +00:00
Ralph Castain	3f55d6897a	Remove the memory debugging options. Fix what appears to be a typo in a help file. This commit was SVN r12107.	2006-10-12 00:44:48 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
Ralph Castain	febc143d8c	Per LANL's stated need, add functionality that runs a.out across ALL available process slots if no num_proc is specified on the command line. However, please note the following limitation: we ONLY allow ONE application to be specified on the command line when this feature is invoked. If multiple apps are specified, the user MUST also specify the number to be launched for each and every one of them. Update the help text to report errors when not following that rule. Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base. The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so. This commit was SVN r10708.	2006-07-10 21:25:33 +00:00
Brian Barrett	9766c01e50	* Per discussion at quarterly meeting and bug #91 , print out the bug contact point when printing version and help strings This commit was SVN r10484.	2006-06-22 19:48:27 +00:00
Brian Barrett	5c89dc6946	Fix for ticket #91 mpirun/orterun now has an option to print the version number. If -V/--version is given, it will print the version number. If it's the only option, we exit cleanly. Otherwise, we continue on as if --version wasn't given (except we've printed the version number). --This line, and th se below, will be ignored-- M orte/tools/orterun/orterun.c M orte/tools/orterun/help-orterun.txt This commit was SVN r10276.	2006-06-09 17:21:23 +00:00
Jeff Squyres	c2c2daa966	Change the behavior of orterun (mpirun, mpirexec) to search for argv[0] and the cwd on the target node (i.e., the node where the executable will be running in all systems except BProc, where the searches are run on the node where orterun is invoked). - fork pls now does cwd and argv[0] search in orted - bproc pls does cwd and argv[0] search in orterun - cwd behavior slightly different: - if user specifies a -wdir to orterun, we chdir() to there; if we can't for some reason, abort - if user does not specify a -wdir, try to chdir() to the dir where orterun was invoked. If we can't for some reason (e.g., it doesn't exist on the target node), then try to chdir($HOME). If we can't do that, then just live with whatever default directory we were put in. This commit was SVN r9068.	2006-02-16 20:40:23 +00:00
Jeff Squyres	8d96c21311	Good weekend brainless activity -- implement the orterun command line debugger scheme described in http://www.open-mpi.org/community/lists/users/2005/10/0214.php. This makes our user-level debugger scheme much more vendor-independent (although the "-tv" option will still work for backwards compatibility -- it'll just be a synonum of "--debug"). This commit was SVN r8206.	2005-11-20 16:06:53 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
Jeff Squyres	65f1adfedc	Add "-tv" option to orterun: orterun -tv -np 4 foo which will turn around and re-exec: totalview orterun -a -np 4 foo This commit was SVN r7636.	2005-10-05 10:24:34 +00:00
Josh Hursey	50e128ab83	Take out the --map command line arguemnt, since it is not handled properly at the moment. Also remove all references to --map, and (C, N) command line options in the help file. These references will be put back in when these options are implemented. This commit was SVN r7574.	2005-10-01 15:51:20 +00:00
Jeff Squyres	383d9f58e7	Be [slightly] more descriptive. :-) This commit was SVN r7198.	2005-09-06 16:57:11 +00:00
Rainer Keller	a36347d728	- Support -prefix specification on mpirun/orterun cmd-line per app_context: mpirun -np 2 -prefix /path/to/ompi/on/machineA ./exec1 : \ -np 2 -prefix /path/to/ompi/on/machineB ./exec2 - Allow with -mca pls_rsh_assume_same_shell 0, the checking for the SHELL-variable on the actual node (currently 1st node). Sets the prefix, PATH and LD_LIBRARY_PATH for bash/ksh and csh/tcsh. This commit was SVN r7195.	2005-09-06 16:10:05 +00:00
Jeff Squyres	b3bd549331	- Change a few calls from exit() to orte_abort() so that we get session directory cleanup (among other things) - When we get an abnormal exit in orterun (i.e., timeout expires and we haven't gotten termination notices from all processes), print a better message an exit in a better way (which includes session directory cleanup) - Fix tm and poe pls's to not exit() but rather propagate the error up the stack (where relevant) This commit was SVN r7058.	2005-08-26 20:36:11 +00:00
Josh Hursey	018c4aa44e	remove unnecessary slashes This commit was SVN r6673.	2005-07-28 21:33:33 +00:00
Josh Hursey	8b56769307	removed the version command line option. Added some more user help messages This commit was SVN r6672.	2005-07-28 21:17:48 +00:00
Jeff Squyres	1b18979f79	Initial population of orte tree This commit was SVN r6266.	2005-07-02 13:42:54 +00:00

27 Коммитов