openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	30f37f762d	Enable co-location of debugger daemons during initial launch and when debugging a running job. Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented. No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release. This commit was SVN r19275.	2008-08-13 17:47:24 +00:00
Terry Dontje	f0eec291d0	Added a couple examples and spelling corrections. This commit was SVN r19234.	2008-08-11 12:34:02 +00:00
Jeff Squyres	797ec531aa	Some more work on the man pages: * Make the creation of the build dir for the man pages a bit more robust (thanks to suggestions from Ralf W.). * Only distribute the .Xin files, not the .X man pages themselves. * Make the .X files depend on opal_config.h so that if you re-run configure and change opal_config.h (e.g., a new version), the man pages should get rebuilt. * Man pages are now cleaned with "distclean", not "maintainer-clean". * Fix a typo in opal_crs.7in. * Udpate make_dist_tarball to update "date" in the VERSION file. * Make make_dist_tarball a bit friendlier to hg checkouts. This commit was SVN r19219.	2008-08-07 19:20:40 +00:00
Ralph Castain	d7da6b3226	Just a minor cleanup of race conditions on trigger events for exit. Close the trigger pipe upon use since it is only a one-shut anyway. This removes the need to destruct the object, leaving the lock available to protect one-time termination routines throughout the life of the program. This commit was SVN r19208.	2008-08-06 21:53:35 +00:00
Ralph Castain	63c33a9c32	Some minor updates to the locking system changes. Remove obsolete locks. Ensure the trigger event objects do not get deconstructed until the very end to avoid possible problems due to race conditions. Route all orted abnormal term tests through the trigger. This commit was SVN r19172.	2008-08-06 11:31:06 +00:00
Ralph Castain	be02211b4f	Modify the wakeup system to make it more Windows-friendly. This allows Shiqing to consolidate the Windows-specific modifications into one location, and generalizes the wakeup procedure in case we hit other system-specific requirements. This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify. This commit was SVN r19159.	2008-08-05 15:09:29 +00:00
Ralph Castain	7342a6f1da	Per the July technical meeting: During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready. At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun. Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!): --wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again. --server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time. This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case. This commit was SVN r19152.	2008-08-04 20:29:50 +00:00
Jeff Squyres	017a4acceb	Missed these during r19141 This commit was SVN r19151. The following SVN revision numbers were found above: r19141 --> open-mpi/ompi@b83ee7d82a	2008-08-04 20:10:55 +00:00
Jeff Squyres	b83ee7d82a	* Fix a problem with VPATH builds if the destination directory didn't already exist * s/top_srcdir/top_builddir/ in a bunch of places; left over from the previous man page generation system This commit was SVN r19141.	2008-08-04 15:17:50 +00:00
Ralph Castain	381d10833a	Remove comments in documentation about the "-nw" option to mpirun as this option doesn't exist, and probably never will This commit was SVN r19137.	2008-08-04 14:38:37 +00:00
Dan Lacher	9175da1e02	Putback for all changes to automate man page updates to strings of versions, dates and build names. Fixes trac:1387 Big thanks to Jeff and Brian for help and oversight. This commit was SVN r19120. The following Trac tickets were found above: Ticket 1387 --> https://svn.open-mpi.org/trac/ompi/ticket/1387	2008-08-01 21:14:37 +00:00
Jeff Squyres	4bdc093746	Fixes trac:1361: mainly add new internal MCA parameter that orterun will set when it launches under debuggers using the --debug option. This commit was SVN r19116. The following Trac tickets were found above: Ticket 1361 --> https://svn.open-mpi.org/trac/ompi/ticket/1361	2008-07-31 22:11:46 +00:00
Ralph Castain	a62b2a0150	Per the July technical meeting: Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways. Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument. For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work. This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do. To solve this problem, we did the following: 1. removed all mca params from the individual plms for the launch agent 2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted". 3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values 4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry. 5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct. This commit was SVN r19097.	2008-07-30 18:26:24 +00:00
Ralph Castain	d45d728e8e	Allow debuggers to attach to a running mpirun by -always- setting up the MPIR_Proctable. Only wait for MPIR_Breakpoint and hold MPI proc s if we are launching under a debugger. This commit was SVN r19079.	2008-07-29 17:39:16 +00:00
Ralph Castain	3107545709	Ensure that ORTE processes such as mpirun and orted never inadvertently bind themselves to cores. Change the mca param name used by the rank_file mapper to get user directives on slot lists to be different from that used by MPI procs to discover their binding. Add a cmd line option to orterun to make it easier for a user to specify the slot list (basically, hide the mca param name). Discussed and reviewed with Lenny and Jeff. This commit was SVN r19062.	2008-07-28 14:18:36 +00:00
Rolf vandeVaart	ed4920ba5f	Fix a couple problems with orte-clean. Also add a new --debug flag to help developers figure out possible future issues. This fixes trac:1335. This commit was SVN r18979. The following Trac tickets were found above: Ticket 1335 --> https://svn.open-mpi.org/trac/ompi/ticket/1335	2008-07-22 17:41:06 +00:00
Rolf vandeVaart	84ede78e40	Clarify use of --path option to mpirun. This commit fixes trac:1231. This commit was SVN r18909. The following Trac tickets were found above: Ticket 1231 --> https://svn.open-mpi.org/trac/ompi/ticket/1231	2008-07-14 21:27:23 +00:00
Ralph Castain	1edcfa970e	Remove debug This commit was SVN r18845.	2008-07-09 00:12:11 +00:00
Ralph Castain	f1114b4144	Upgrade the ability of orterun to deal with cmd line MCA params that are passed to the orteds. Help reduce the size of the cmd line by eliminating duplicates where possible, and alert to duplicate entries that can cause problems. Add comments to both orterun and orted code explaining why we take a snapshot of the local environment and apply it to the local procs when they are spawned. This commit was SVN r18842.	2008-07-08 22:36:39 +00:00
Ralph Castain	51da9f2980	Properly ensure that cmd line MCA params override environmental MCA params for procs local to mpirun. Actually, the problem was that we were simply -adding- any enviro MCA params to whatever had been found on the cmd line. Thus, duplicate MCA param directives were winding up duplicated in the environment. Some shells took the first one in the environ array - others took the last! So we could get completely different behavior based on the whims of the shell. This commit fixes trac:1373 This commit was SVN r18836. The following Trac tickets were found above: Ticket 1373 --> https://svn.open-mpi.org/trac/ompi/ticket/1373	2008-07-08 13:48:47 +00:00
Lenny Verkhovsky	23da11fdcc	add some documentation for mpirun's new --loadbalance option closing #1277 This commit was SVN r18832.	2008-07-08 07:56:54 +00:00
Lenny Verkhovsky	489c22b6b1	Added -rf\|--rankfile <arg0> oprion to mpirun -h + man info This commit was SVN r18812.	2008-07-07 13:46:22 +00:00
Lenny Verkhovsky	2fae770a86	information about carto framework in mpirun man page, ticket #1372 This commit was SVN r18809.	2008-07-07 12:02:10 +00:00
Ralph Castain	ba5498cdc6	Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 This commit was SVN r18804.	2008-07-03 17:53:37 +00:00
Ralph Castain	b118779c08	It is okay for us to init the ORTE mca params multiple times. Indeed, it is absolutely required by orterun as the first time has to be done prior to parsing the command line, which means that the mca values haven't been parsed yet! Add ability for sys admins to prohibit putting session directories under specified locations. Thus, they can now protect parallel file systems from foolish user mistakes. This commit was SVN r18721.	2008-06-24 17:50:56 +00:00
Jeff Squyres	24c3aa1d77	Really fix "make dist". Really. This commit was SVN r18704.	2008-06-21 18:04:38 +00:00
Jeff Squyres	930667ac73	Ensure that orte-checkpoint and orte-restart man pages are always included in the distribution tarball. This ''appears'' to be an Automake bug -- I have submitted a bug report to the bug-automake list: http://lists.gnu.org/archive/html/bug-automake/2008-06/msg00019.html This commit was SVN r18696.	2008-06-20 18:19:01 +00:00
Jeff Squyres	e4172a3c44	Shift the AM "if" logic down from orte/tools/Makefile.am down to the individual orte/tools/*/Makefile.am files. This causes "make" to travese into every directory, even if it's not going to build anything in that directory (which is a good thing). It also helps cleanup and dist issues. This also affects orte-checkpoint and orte-restart, but I couldn't get --with-ft to compile properly; I'll pass along a heads-up to Josh to ensure that I didn't break anything. This commit was SVN r18680.	2008-06-19 14:46:10 +00:00
Ralph Castain	571f483c39	Ensure that we don't breakpoint the debugger until -after- all procs have reported their contact info so we can successfully send the release message This commit was SVN r18678.	2008-06-19 14:37:46 +00:00
Jeff Squyres	7e45b24001	MPIR_being_debugged is an int, not a bool. This commit was SVN r18676.	2008-06-19 13:31:34 +00:00
Ralph Castain	b56f8ced4f	Ensure params are registered prior to parsing global cmd line options in orterun so that debugger options are properly captured and acted upon. Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs. This commit was SVN r18674.	2008-06-19 02:58:14 +00:00
Ralph Castain	282a220e7e	Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation This commit was SVN r18670.	2008-06-18 15:28:46 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Brian Barrett	7712b07ac4	Add perl based wrapper compilers for cross-compile environments. The default is still to use the C based wrapper compilers (which have many more features and are more well tested). The Perl compilers are enabled with the option --enable-script-wrapper-compilers, which also ignores the option --disable-binaries (ie --enable-script-wrapper-compilers --disable-binaries will result in perl-based wrapper compilers being installed, but no other binaries being installed). This commit was SVN r18655.	2008-06-13 22:52:25 +00:00
Ralph Castain	1f41069ac9	Fix CID 752 - if we can't find the daemon job object, we have to ensure we exit without attempting to dereference it This commit was SVN r18647.	2008-06-11 14:49:58 +00:00
Ralph Castain	13ea4e4673	Be consistent - since we don't strdup the other values for param, don't strdup this one. This allows r18645 to fix the memory corruption issue, but also allows us to resolve the memory leaks cited by CID 1039 This commit was SVN r18646. The following SVN revision numbers were found above: r18645 --> open-mpi/ompi@53d83ba1c5	2008-06-11 14:42:47 +00:00
Pak Lui	53d83ba1c5	Take out a couple of free's. This commit fixes trac:1343 This commit was SVN r18645. The following Trac tickets were found above: Ticket 1343 --> https://svn.open-mpi.org/trac/ompi/ticket/1343	2008-06-11 14:02:49 +00:00
Ralph Castain	1a422995ae	Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak. This commit was SVN r18641.	2008-06-10 17:53:28 +00:00
Ralph Castain	c13cadc3c7	Refs trac:1255 This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something. So the ticket can be checked by Jeff when he returns from vacation... :-) This commit was SVN r18625. The following Trac tickets were found above: Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255	2008-06-09 20:34:14 +00:00
Ralph Castain	2cc8b2c51f	Add yet another test, this one for proper error behavior when someone call an MPI function after calling MPI_Finalize. Add a minor debug that outputs the orterun exit status to stderr when orte_debug is set. This commit was SVN r18622.	2008-06-09 19:21:20 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Ralph Castain	83dd3d8c6f	Restore the ability to forcibly terminate by providing multiple ctrl-c's This commit was SVN r18618.	2008-06-09 13:08:54 +00:00
Ralph Castain	0da811ce79	Initial work on xml support - allocation and job map outputs completed. More to come. This commit was SVN r18587.	2008-06-04 20:53:12 +00:00
Josh Hursey	78f14b5255	Fix the none.checkpoint command. orte-checkpoint/orte-restart seem to not seem to totally like orte_output so revert them to opal_output for now. Since we have no need for the additional complexity of orte_output we can drop it for now and revisit this if anyone needs it later. It seems that if you set the verbose level on an output handle then try to call a normal orte_output() on it then the message will not be printed. This is the same for opal_output, and seems incorrect to me because it stops some error messages from being printed out if you do not directly specify opal_output(0, ...). Maybe someone should take a look a this. orte-checkpoint would segv if passed an incorrect PID. Fixed the return code so it errors out properly. Thanks to Eric Roman for bringing this to my attention. This commit was SVN r18583.	2008-06-04 14:44:11 +00:00
Jeff Squyres	a4db97c213	More man pages fixes from our Debian Open MPI package maintainer friends. Woo hoo! This commit was SVN r18559.	2008-06-03 16:44:40 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	b456fb2d42	Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2heartrate, declares orted failure if not seen in last 2heartrate time. Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher. Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported. More info to come on devel list. This commit was SVN r18555.	2008-06-02 21:46:34 +00:00
Ralph Castain	2772221ae0	Add the -xml option to mpirun to indicate that xml output is desired This commit was SVN r18539.	2008-05-29 14:11:31 +00:00
Ralph Castain	72530f8fed	Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done. 2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear. This represents the best-available fix for ticket #221 so I am closing that ticket at this time. This commit was SVN r18536.	2008-05-29 13:38:27 +00:00
Jeff Squyres	af1a9290cb	Ensure to check that opal_init_util() didn't fail. This commit was SVN r18453.	2008-05-19 11:58:48 +00:00

1 2 3 4 5 ...

412 Коммитов