openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	0597fdd778	Ensure that orte-iof barks when given an unrecognized cmd line option This commit was SVN r20397.	2009-02-02 14:10:54 +00:00
Ralph Castain	b19dc2a4fa	Update mpirun's man page for report-pid and report-uri options This commit was SVN r20396.	2009-02-02 13:49:07 +00:00
Ralph Castain	2966206f58	Fix a race condition in the IOF and add some new user-requested features: 1. fix a race condition whereby a proc's output could trigger an event prior to the other outputs being setup, thus c ausing the IOF to declare the proc "terminated" too early. This was really rare, but could happen. 2. add a new "timestamp-output" option that timestamp's each line of output 3. add a new "output-filename" option that redirects each proc's output to a separate rank-named file. 4. add a new "xterm" option that redirects the output of the specified ranks to a separate xterm window. This commit was SVN r20392.	2009-01-30 22:47:30 +00:00
Rolf vandeVaart	0704b98668	Add the ability to forward SIGTSTP (converted to SIGSTOP) and SIGCONT to the a.outs. By default, they are not forwarded and the behavior remains as it has always been. However, if one runs with --mca orte_forward_job_control 1, then mpirun will catch those two signals and forward them to the orteds which will deliver them to the a.outs. We have had requests for this feature. This commit was SVN r20391.	2009-01-30 18:50:10 +00:00
Ralph Castain	5e6d3ba289	Initial implementation of static ports. Provide an mca param to specify static port ranges to the OOB - can provide an y combination of comma-separated values and ranges. Daemons will use the first port in the range, MPI procs will use the other ports in the range assuming that they know their node rank in time and enough ports were specified. NOTE: this capability only works under specific conditions. I will outline more about this in a note to devel as the remainder of the implementation progresses. For now, the only environment where this works is slurm. The linear routed module has also been adjusted to work with static ports so that all messaging flows strictly through the topology, including the initial daemon callback - thus limiting the number of sockets opened by mpirun. This commit was SVN r20390.	2009-01-30 18:31:43 +00:00
Jeff Squyres	bb3d258562	Round up a few places where PATH_MAX was used instead of OMPI_PATH_MAX. Thanks to Andrea Iob for the bug report. This commit was SVN r20360.	2009-01-27 22:57:50 +00:00
Ralph Castain	fd5e15ea58	Since parsing comma-delimited, range-capable options is being used in multiple places, create a new utility that consolidates that code. Have orte-iof use it. This commit was SVN r20346.	2009-01-25 17:16:25 +00:00
Ralph Castain	0435108834	Improve the efficiency of the launch system by changing the outer loop to being over app_context, and adding a flag to the app_context so the daemon can record that "this app is on my node" when decoding the launch msg. If the --wdir option is given, check to see if the user provided a relative path. If so, convert it to an absolute path. This is needed to maintain consistent behavior across environements. Some environments automatically chdir to your current working directory when launching the remote orted, while others (e.g., ssh) don't. This levels the playing field and reduces user surprise. This commit was SVN r20342.	2009-01-25 12:39:24 +00:00
Jeff Squyres	90e69ac6ff	Fix some man page nits noticed by the Debain OMPI maintainers. Thanks Dirk! This commit was SVN r20307.	2009-01-21 18:38:37 +00:00
Josh Hursey	abfc7c6076	Per ticket #1527 orte-restart should be using {{{--default-hostfile}}} instead of {{{--hostfile}}} with app contexts. Thanks to Gregor Dschung for reporting the problem. This commit was SVN r20305.	2009-01-21 14:08:16 +00:00
Jeff Squyres	0c8f8fe1ea	Fix CID 733: remove some dead code (proc_name was set but effectively never used). This commit was SVN r20271.	2009-01-14 18:12:06 +00:00
Jeff Squyres	d1c6f3f89a	* Fix a truckload of Cisco copyrights to be the same as the rest of the code base. * Fix a few misspellings in other copyrights. This commit was SVN r20241.	2009-01-11 02:30:00 +00:00
Jeff Squyres	1bacdef317	Fix CID 1188. Minor issue; just convert to snprintf instead of sprintf. This commit was SVN r20185.	2009-01-03 14:46:46 +00:00
Ralph Castain	bb96474d6e	Per request from Aurelien, make orterun report-pid and report-uri functions work the same as that of ompi-server. Since these are used for ompi-server-like functionality, it makes sense that the report options work the same. Make orte-top take the corresponding input the same way too for consistency. The modified cmd line options are: --report-uri x where x is either '-' for stdout, '+' for stderr, or a filename --report-pid x where x is the same as above For orte-top, you can now provide either a pid or a uri (which allows connection to remote mpiruns), specified either directly or with a "file:x" option as per mpirun's ompi-server option. Note: I did not add a report-pid option to ompi-server as it probably wouldn't be useful - the report-uri option works as well, and allows remote access (which is likely the normal way it would be used). This commit was SVN r20168.	2008-12-24 15:27:46 +00:00
Ralph Castain	7787f84540	Per the earlier RFC and some discussion at the Dec ORTE design meeting, add the ompi-top tool and all its supporting infrastructure. This includes a new OPAL pstat framework and data type, currently with rather weak support for Mac OSX and pretty complete support for Linux. The Sun team promised to add Solaris support as well. Also, per chat with Jeff, modified the Makefile.am's of a few orte tools so that they were consistent in the way we generate the ompi-equivalent cmds. This commit was SVN r20165.	2008-12-22 20:23:05 +00:00
Shiqing Fan	20cea164db	- 3/4 commit for Windows Visual Studio and CCP support: corrections to non-windows files (but within ifdef __WINDOWS__) type casts, event library for windows use win32. in orte runtime, add windows sockets handling and object construction. This commit was SVN r20110.	2008-12-10 21:13:10 +00:00
Shiqing Fan	a5281f0434	- 1/4 commit for Windows Visual Studio and CCP support: CMakeLists and .windows files. In contribs preconfigured and precompiled parts. This commit was SVN r20108.	2008-12-10 20:59:20 +00:00
Ralph Castain	728a24c8ec	After considerable patience and help with debugging/testing from Tim M and Jeff S, return a completed and pretty well tested patch of the IOF to the trunk. This commit includes the previously reverted r20074, r20068, and r20064, as well as changes to fix those commits. Basically, the remaining problem turned out to be: 1. closing stdout/stderr during orte_finalize of mpirun 2. inadvertently setting up a write event on fd = -1 3. devising a scheme to more accurately track when the stdin write event was active vs closed so it only got released once This passed prelim MTT testing by Jeff and Tim, but should soak for awhile before migrating to 1.3. This commit was SVN r20106. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-10 20:40:47 +00:00
Ralph Castain	7e3ddb09d3	As requested by Aurelien at the July design meeting - long time coming, but finally got around to it. Enable one mpirun to act as the server for another mpirun when doing MPI_Publish_name and its associated operations. The user is responsible, of course, for ensuring that the mpirun acting as a server outlives any mpiruns using it in that capacity. Add a cmd line option to mpirun --report-pid that prints out mpirun's pid. Allow the --ompi-server option to now take pid:# (or PID:#) of the mpirun to be used as the server, and then look that pid up by searching the local mpirun contact infos for it. This commit was SVN r20102.	2008-12-10 17:10:39 +00:00
Ralph Castain	1ace83c470	Enable modex-less launch. Consists of: 1. minor modification to include two new opal MCA params: (a) opal_profile: outputs what components were selected by each framework currently enabled for most, but not all, frameworks (b) opal_profile_file: name of file that contains profile info required for modex 2. introduction of two new tools: (a) ompi-probe: MPI process that simply calls MPI_Init/Finalize with opal_profile set. Also reports back the rml IP address for all interfaces on the node (b) ompi-profiler: uses ompi-probe to create the profile_file, also reports out a summary of what framework components are actually being used to help with configuration options 3. modification of the grpcomm basic component to utilize the profile file in place of the modex where possible 4. modification of orterun so it properly sees opal mca params and handles opal_profile correctly to ensure we don't get its profile 5. similar mod to orted as for orterun 6. addition of new test that calls orte_init followed by calls to grpcomm.barrier This is all completely benign unless actively selected. At the moment, it only supports modex-less launch for openib-based systems. Minor mod to the TCP btl would be required to enable it as well, if people are interested. Similarly, anyone interested in enabling other BTL's for modex-less operation should let me know and I'll give you the magic details. This seems to significantly improve scalability provided the file can be locally located on the nodes. I'm looking at an alternative means of disseminating the info (perhaps in launch message) as an option for removing that constraint. This commit was SVN r20098.	2008-12-09 23:49:02 +00:00
Ralph Castain	e28210d0dc	Revert r20074, r20068, and r20064: remove the IOF proc completion code pending further off-trunk work. This commit was SVN r20089. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-09 17:11:59 +00:00
Ralph Castain	a07660aea8	Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app 2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required 3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin. This commit was SVN r20064.	2008-12-03 17:45:42 +00:00
Josh Hursey	d5c38c2601	fix some typos. should be moved to v1.3 This commit was SVN r19964.	2008-11-10 19:05:26 +00:00
Ralph Castain	b700fff1d1	Update the orterun man page - it was missing numerous options and had several out-of-date ones in it This commit was SVN r19888.	2008-11-03 16:44:27 +00:00
Ralph Castain	03257ebb9f	Since orte-iof doesn't yet know how to handle multiple ranks, adjust the help message and the code to correctly handle one specified rank This commit was SVN r19886.	2008-11-03 14:26:01 +00:00
Brad Penoff	d7b0fdfe5c	small fix to compile trunk on FreeBSD 7 This commit was SVN r19817.	2008-10-28 03:44:23 +00:00
Jeff Squyres	b11d13cc05	Silence a trivial compiler warnings (pgcc). This commit was SVN r19810.	2008-10-27 14:23:02 +00:00
Ralph Castain	c56cdac379	Finish cleanup of stdin. Set non-stdio file descriptors to non-blocking (thanks to Jeff for catching that one). Handle writes that result in "would have blocked" errno. This commit was SVN r19793.	2008-10-24 01:42:58 +00:00
Ralph Castain	6e5d844c36	Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 1. completely and cleanly separates responsibilities between the HNP, orted, and tool components. 2. removes all wireup messaging during launch and shutdown. 3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol. 4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0. 5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none". 6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout. 7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output" This is not intended for the 1.3 release as it is a major change requiring considerable soak time. This commit was SVN r19767.	2008-10-18 00:00:49 +00:00
Ralph Castain	b46d3e766e	Cleanup the plm failed-to-start problem a little - ensure that the event is always defined so we don't have to check when trying to trigger it, thus avoiding potential race conditions. This commit was SVN r19755.	2008-10-16 14:58:32 +00:00
Ralph Castain	48c3de1865	Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it. The segv was caused by the fact that we don't really know which daemon failed to start (at least, in most cases), so we didn't set a pointer to the aborted proc object. All we really wanted, though, was to ensure that mpirun returned a non-zero exit status, so the fix was to simply return the default error status. This commit was SVN r19754.	2008-10-16 14:21:37 +00:00
Ralph Castain	802d14b130	Default job controls to forward IO. Adjust debugger code to not forward IO unless requested. This commit was SVN r19690.	2008-10-06 17:56:23 +00:00
Ralph Castain	8d1ecdb361	Correct the creation of MPIR_Proctable so that the structs in the array correspond to the order of the ranks. This commit was SVN r19624.	2008-09-24 14:55:46 +00:00
Ralph Castain	e64b79f30f	Modify the --display-map and --display-alloc per note on devel list to reduce info for user understanding. Add --display-devel-map and --display-devel-alloc to display all the detailed info we used to provide - it is only of use/interest to developers anyway and confuses users. This commit was SVN r19608.	2008-09-23 15:46:34 +00:00
Ralph Castain	4e0f34a062	When we hit an error prior to actually launching daemons, it would be nice if orterun didn't bark about daemons failing to launch, mpirun detecting a job failed, etc. Add a new job state to indicate that we never attempted to launch. Flag such a scenario and avoid hitting all the other error messages. This commit was SVN r19366.	2008-08-19 15:19:30 +00:00
Ralph Castain	49745c5f40	Provide a new option that allows a user to leave an ssh session open without getting deluged by ORTE debug output. The new option is --leave-session-attached, with a corresponding MCA param of orte_leave_session_attached. Theoretically, any PLM could use this - but in reality, all of them except rsh/ssh already leave the session attached anyway. This fixes trac:656 - a REALLY old ticket This commit was SVN r19294. The following Trac tickets were found above: Ticket 656 --> https://svn.open-mpi.org/trac/ompi/ticket/656	2008-08-14 18:59:01 +00:00
Ralph Castain	30f37f762d	Enable co-location of debugger daemons during initial launch and when debugging a running job. Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented. No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release. This commit was SVN r19275.	2008-08-13 17:47:24 +00:00
Terry Dontje	f0eec291d0	Added a couple examples and spelling corrections. This commit was SVN r19234.	2008-08-11 12:34:02 +00:00
Jeff Squyres	797ec531aa	Some more work on the man pages: * Make the creation of the build dir for the man pages a bit more robust (thanks to suggestions from Ralf W.). * Only distribute the .Xin files, not the .X man pages themselves. * Make the .X files depend on opal_config.h so that if you re-run configure and change opal_config.h (e.g., a new version), the man pages should get rebuilt. * Man pages are now cleaned with "distclean", not "maintainer-clean". * Fix a typo in opal_crs.7in. * Udpate make_dist_tarball to update "date" in the VERSION file. * Make make_dist_tarball a bit friendlier to hg checkouts. This commit was SVN r19219.	2008-08-07 19:20:40 +00:00
Ralph Castain	d7da6b3226	Just a minor cleanup of race conditions on trigger events for exit. Close the trigger pipe upon use since it is only a one-shut anyway. This removes the need to destruct the object, leaving the lock available to protect one-time termination routines throughout the life of the program. This commit was SVN r19208.	2008-08-06 21:53:35 +00:00
Ralph Castain	63c33a9c32	Some minor updates to the locking system changes. Remove obsolete locks. Ensure the trigger event objects do not get deconstructed until the very end to avoid possible problems due to race conditions. Route all orted abnormal term tests through the trigger. This commit was SVN r19172.	2008-08-06 11:31:06 +00:00
Ralph Castain	be02211b4f	Modify the wakeup system to make it more Windows-friendly. This allows Shiqing to consolidate the Windows-specific modifications into one location, and generalizes the wakeup procedure in case we hit other system-specific requirements. This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify. This commit was SVN r19159.	2008-08-05 15:09:29 +00:00
Ralph Castain	7342a6f1da	Per the July technical meeting: During the discussion of MPI-2 functionality, it was pointed out by Aurelien that there was an inherent race condition between startup of ompi-server and mpirun. Specifically, if someone started ompi-server to run in the background as part of a script, and then immediately executed mpirun, it was possible that an MPI proc could attempt to contact the server (or that mpirun could try to read the server's contact file before the server is running and ready. At that time, we discussed createing a new tool "ompi-wait-server" that would wait for the server to be running, and/or probe to see if it is running and return true/false. However, rather than create yet another tool, it seemed just as effective to add the functionality to mpirun. Thus, this commit creates two new mpirun cmd line flags (hey, you can never have too many!): --wait-for-server : instructs mpirun to ping the server to see if it responds. This causes mpirun to execute an rml.ping to the server's URI with an appropriate timeout interval - if the ping isn't successful, mpirun attempts it again. --server-wait-time xx : sets the ping timeout interval to xx seconds. Note that mpirun will attempt to ping the server twice with this timeout, so we actually wait for twice this time. Default is 10 seconds, which should be plenty of time. This has only lightly been tested. It works if the server is present, and outputs a nice error message if it cannot be contacted. I have not tested the race condition case. This commit was SVN r19152.	2008-08-04 20:29:50 +00:00
Jeff Squyres	017a4acceb	Missed these during r19141 This commit was SVN r19151. The following SVN revision numbers were found above: r19141 --> open-mpi/ompi@b83ee7d82a	2008-08-04 20:10:55 +00:00
Jeff Squyres	b83ee7d82a	* Fix a problem with VPATH builds if the destination directory didn't already exist * s/top_srcdir/top_builddir/ in a bunch of places; left over from the previous man page generation system This commit was SVN r19141.	2008-08-04 15:17:50 +00:00
Ralph Castain	381d10833a	Remove comments in documentation about the "-nw" option to mpirun as this option doesn't exist, and probably never will This commit was SVN r19137.	2008-08-04 14:38:37 +00:00
Dan Lacher	9175da1e02	Putback for all changes to automate man page updates to strings of versions, dates and build names. Fixes trac:1387 Big thanks to Jeff and Brian for help and oversight. This commit was SVN r19120. The following Trac tickets were found above: Ticket 1387 --> https://svn.open-mpi.org/trac/ompi/ticket/1387	2008-08-01 21:14:37 +00:00
Jeff Squyres	4bdc093746	Fixes trac:1361: mainly add new internal MCA parameter that orterun will set when it launches under debuggers using the --debug option. This commit was SVN r19116. The following Trac tickets were found above: Ticket 1361 --> https://svn.open-mpi.org/trac/ompi/ticket/1361	2008-07-31 22:11:46 +00:00
Ralph Castain	a62b2a0150	Per the July technical meeting: Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways. Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument. For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work. This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do. To solve this problem, we did the following: 1. removed all mca params from the individual plms for the launch agent 2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted". 3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values 4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry. 5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct. This commit was SVN r19097.	2008-07-30 18:26:24 +00:00
Ralph Castain	d45d728e8e	Allow debuggers to attach to a running mpirun by -always- setting up the MPIR_Proctable. Only wait for MPIR_Breakpoint and hold MPI proc s if we are launching under a debugger. This commit was SVN r19079.	2008-07-29 17:39:16 +00:00

1 2 3 4 5 ...

448 Коммитов