openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	f326ee356e	Add some error output to the plm rsh This commit was SVN r19532.	2008-09-10 01:59:49 +00:00
Ralph Castain	9b8473fdbf	Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. Cleanup when an orted actually opens the PLM. Unfortunately, some unmentionable people are pushing head node environs out to remote nodes, causing the daemons to think they are the HNP. This helps prevent the confusion. This commit was SVN r19518.	2008-09-08 15:45:11 +00:00
Shiqing Fan	cd6ff74d89	Update the ccp module: rename the get_cluster_message function for both ras/plm. use _umask instead of umask. add WIN32_DCOM definition to support Windows Vista. This commit was SVN r19470.	2008-09-01 16:35:38 +00:00
Ralph Castain	43f8bcfe54	Update slurm plm to respect leave_session_attached This commit was SVN r19370.	2008-08-19 18:30:30 +00:00
Ralph Castain	4e0f34a062	When we hit an error prior to actually launching daemons, it would be nice if orterun didn't bark about daemons failing to launch, mpirun detecting a job failed, etc. Add a new job state to indicate that we never attempted to launch. Flag such a scenario and avoid hitting all the other error messages. This commit was SVN r19366.	2008-08-19 15:19:30 +00:00
Ralph Castain	49745c5f40	Provide a new option that allows a user to leave an ssh session open without getting deluged by ORTE debug output. The new option is --leave-session-attached, with a corresponding MCA param of orte_leave_session_attached. Theoretically, any PLM could use this - but in reality, all of them except rsh/ssh already leave the session attached anyway. This fixes trac:656 - a REALLY old ticket This commit was SVN r19294. The following Trac tickets were found above: Ticket 656 --> https://svn.open-mpi.org/trac/ompi/ticket/656	2008-08-14 18:59:01 +00:00
Ralph Castain	30f37f762d	Enable co-location of debugger daemons during initial launch and when debugging a running job. Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented. No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release. This commit was SVN r19275.	2008-08-13 17:47:24 +00:00
Ralph Castain	f017c55bfa	Close a minor memory leak - we can reuse timer events This commit was SVN r19251.	2008-08-12 12:53:30 +00:00
Ralph Castain	be02211b4f	Modify the wakeup system to make it more Windows-friendly. This allows Shiqing to consolidate the Windows-specific modifications into one location, and generalizes the wakeup procedure in case we hit other system-specific requirements. This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify. This commit was SVN r19159.	2008-08-05 15:09:29 +00:00
Ralph Castain	5b2f53a069	One more quick fix - ensure we are looking at the value and not its pointer This commit was SVN r19123.	2008-08-01 23:39:55 +00:00
Jeff Squyres	26c7daf16a	Fix typo This commit was SVN r19121.	2008-08-01 21:30:53 +00:00
Ralph Castain	21cd4b9df8	Add pls_rsh_agent synonym to the PLM rsh component This commit was SVN r19119.	2008-08-01 20:15:42 +00:00
Ralph Castain	a62b2a0150	Per the July technical meeting: Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways. Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument. For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work. This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do. To solve this problem, we did the following: 1. removed all mca params from the individual plms for the launch agent 2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted". 3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values 4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry. 5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct. This commit was SVN r19097.	2008-07-30 18:26:24 +00:00
George Bosilca	a4d905db4a	Allow xgrid to compile. This commit was SVN r19076.	2008-07-29 13:24:08 +00:00
Jeff Squyres	0af7ac53f2	Fixes trac:1392, #1400 * add "register" function to mca_base_component_t * converted coll:basic and paffinity:linux and paffinity:solaris to use this function * we'll convert the rest over time (I'll file a ticket once all this is committed) * add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier * new mca_base_component_t size: 196 bytes * new mca_base_component_data_2_0_0_t size: 36 bytes * MCA base version bumped to v2.0 * '''We now refuse to load components that are not MCA v2.0.x''' * all MCA frameworks versions bumped to v2.0 * be a little more explicit about version numbers in the MCA base * add big comment in mca.h about versioning philosophy This commit was SVN r19073. The following Trac tickets were found above: Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392	2008-07-28 22:40:57 +00:00
Thomas Herault	28dc80b67e	Deal with the SIGCHLD issue in LSF. lsb_launch tampers with SIGCHLD signal handler. We are forced to reinstall our own signal handler after a call to this function. This commit fixes trac:1356. This commit was SVN r19033. The following Trac tickets were found above: Ticket 1356 --> https://svn.open-mpi.org/trac/ompi/ticket/1356	2008-07-25 15:23:23 +00:00
Thomas Herault	b6affd35e9	Small typos for LSF compilation and update Makefile.am This commit was SVN r18998.	2008-07-23 14:42:26 +00:00
George Bosilca	bcac9a0540	Remove a warning about using map when it is not initialized. This commit was SVN r18957.	2008-07-21 14:35:05 +00:00
Ralph Castain	b1f367563c	Few minor code cleanups This commit was SVN r18890.	2008-07-11 15:40:41 +00:00
Ralph Castain	58964b2bf8	Make lsf support compile This commit was SVN r18889.	2008-07-11 15:40:25 +00:00
Shiqing Fan	67a842fd17	Too many actual parameters for this function, remove the wrong one in order to get rid of the compiler warnings. This commit was SVN r18863.	2008-07-10 08:06:53 +00:00
Pak Lui	924bface15	The plm env var should set to the name of a current plm module, which is rsh. This commit was SVN r18844.	2008-07-08 23:15:52 +00:00
Shiqing Fan	5d0f4dc88d	- Clean up the unreferenced variables. - Change the arguments for launch failed function according to changeset r18611. This commit was SVN r18795. The following SVN revision numbers were found above: r18611 --> open-mpi/ompi@7bee71aa59	2008-07-03 10:11:08 +00:00
Shiqing Fan	a3e1718126	Missing one argument for calling this function. This commit was SVN r18793.	2008-07-01 18:01:22 +00:00
Ralph Castain	9cebe0ca96	Ckpt the bproc support. All compiles now except for PLM module This commit was SVN r18744.	2008-06-26 03:48:22 +00:00
Ralph Castain	17fcd72b5d	Restore bproc code - if someone wants to maintain it, then more power to them...but it would definitely be easier if the old code is in the trunk. This is all .ompi_ignore'd except for me so I can play with making it compile again in my copious free time. This commit was SVN r18716.	2008-06-24 01:27:22 +00:00
Ralph Castain	3e61a3f92e	Sandbox for next-gen launch This commit was SVN r18715.	2008-06-24 01:25:51 +00:00
Ralph Castain	f799ea225f	Orterun creates a "clean" copy of its environment for use in launching procs. This includes properly setting LD_LIBRARY_PATH and PATH, among other things. Unfortunately, our PLM modules were using the local environ instead of the saved copy, thus missing a number of things that really should have been included. From what I see, we got away with the error because the PLM's were duplicating all that setup logic themselves - I'll clean this up over the next few days. Meantime, correct the PLM's so they use the correct environ for launching. This commit was SVN r18713.	2008-06-23 22:39:36 +00:00
Ralph Castain	b56f8ced4f	Ensure params are registered prior to parsing global cmd line options in orterun so that debugger options are properly captured and acted upon. Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs. This commit was SVN r18674.	2008-06-19 02:58:14 +00:00
Ralph Castain	282a220e7e	Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation This commit was SVN r18670.	2008-06-18 15:28:46 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	1a422995ae	Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak. This commit was SVN r18641.	2008-06-10 17:53:28 +00:00
George Bosilca	f72ab90b16	Allow xgrid to compile again. This commit was SVN r18631.	2008-06-09 21:51:41 +00:00
Ralph Castain	c13cadc3c7	Refs trac:1255 This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something. So the ticket can be checked by Jeff when he returns from vacation... :-) This commit was SVN r18625. The following Trac tickets were found above: Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255	2008-06-09 20:34:14 +00:00
Ralph Castain	bf5c34d10a	The rsh launcher is one place where multi-word MCA params would have to be passed via the orted cmd line. In such a case, we have to explicitly include quote marks about the param value. Add that capability here. This commit fixes trac:1200 This commit was SVN r18621. The following Trac tickets were found above: Ticket 1200 --> https://svn.open-mpi.org/trac/ompi/ticket/1200	2008-06-09 19:07:19 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Pak Lui	caac0e0182	Add in a couple missing ones from r18611 for all tm users out there... This commit was SVN r18615. The following SVN revision numbers were found above: r18611 --> open-mpi/ompi@7bee71aa59	2008-06-06 22:53:43 +00:00
Ralph Castain	b65eb54ea2	Cut out a new iof pull - that capability isn't ready yet for the trunk, but will be coming shortly Thanks to Pak for letting me know... This commit was SVN r18614.	2008-06-06 21:24:15 +00:00
Ralph Castain	7bee71aa59	Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit. Add a new function to opal_progress that tells us our recursion depth to support that solution. Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death... Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang. After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them. Thanks you Jeff! :-/ This commit was SVN r18611.	2008-06-06 19:36:27 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
Ralph Castain	332e6c89ab	Modify the slurm launcher so that the kill-on-bad-exit behavior is not "on" by default. Instead, only turn it "on" if the plm_slurm_detect_failure mca param is set to something non-zero This commit was SVN r18588.	2008-06-04 23:59:53 +00:00
George Bosilca	25ae9c12e6	Silence few warnings. This commit was SVN r18568.	2008-06-03 19:58:40 +00:00
George Bosilca	fa89d299bf	Silence the Obj-C compiler. This commit was SVN r18567.	2008-06-03 19:24:17 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	95578b0528	Fix single-node operations so that the HNP correctly exits when the job completes This commit was SVN r18556.	2008-06-03 14:23:04 +00:00
Ralph Castain	b456fb2d42	Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2heartrate, declares orted failure if not seen in last 2heartrate time. Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher. Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported. More info to come on devel list. This commit was SVN r18555.	2008-06-02 21:46:34 +00:00
Shiqing Fan	af656b2b3d	Fix some typing mistakes, make the sources compile again for Windows Visual Studio. This commit was SVN r18542.	2008-05-29 15:27:43 +00:00
Ralph Castain	2b28bef15a	Provide a "nicer" indication that we don't know the pid of the failed orted This commit was SVN r18538.	2008-05-29 14:10:58 +00:00
Ralph Castain	72530f8fed	Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done. 2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear. This represents the best-available fix for ticket #221 so I am closing that ticket at this time. This commit was SVN r18536.	2008-05-29 13:38:27 +00:00
Ralph Castain	52fb773c6c	Tell slurm to kill the job if an orted abnormally exits This commit was SVN r18535.	2008-05-29 12:26:58 +00:00

1 2 3

105 Коммитов