openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	61c21d787d	Add missing param in tm launcher This commit was SVN r20087.	2008-12-09 13:31:33 +00:00
Ralph Castain	ce4018efeb	Take a step back on the slurm and tm launchers. Problems were occurring in the MTT runs, although not under non-MTT scenarios. Preserve the modified plm versions in new components that are ompi_ignored until we can resolve the problems. This will allow for better MTT coverage until the problem can be better understood. This commit was SVN r20083.	2008-12-09 00:32:04 +00:00
Ralph Castain	a07660aea8	Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app 2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required 3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin. This commit was SVN r20064.	2008-12-03 17:45:42 +00:00
Ralph Castain	7213c109ac	Revamp the TM plm module so that we detect orted termination without requiring a callback message by using the TM native capabilities. This allows TM to function with fully routed OOB comm, and to tell us what node failed to spawn a daemon. This commit was SVN r20027.	2008-11-20 18:57:35 +00:00
Ralph Castain	89559396ea	Resolve a race condition when running under a SLURM environment. The slurm plm fork/exec's a call to srun to launch its daemons. When mpirun terminates, it then sends out a "terminate" command to those daemons. The daemons respond back to mpirun, and then exit. If slurm itself is running on a slow network, and mpirun is running the OOB across a fast network, then it is possible for mpirun to receive notification of daemon termination and exit -before- the srun can complete its bookkeeping and declare the job as complete. When this happens, slurm becomes confused and loses state. Mucho bad. :-/ This commit changes the termination logic so that mpirun will wait for srun to report complete before exiting. It also enables fully routed communications since it no longer requires daemons to report back that they are terminating, thus allowing the daemons to terminate asynchronously (thereby breaking routing paths). This commit was SVN r20018.	2008-11-18 13:59:23 +00:00
Jeff Squyres	f4ba25cf3c	Remove linking components against ORTE and OPAL libs. This was removed from all other components long ago; I'm not sure how these survived. This commit was SVN r19956.	2008-11-08 00:56:57 +00:00
Ralph Castain	58fe779388	Remove double destruct to fix segv when ctrl-c is used to terminate job This commit was SVN r19875.	2008-11-02 02:25:20 +00:00
George Bosilca	9f17d1d67d	Allow xgrid to compile with the changes from 19866. This commit was SVN r19868.	2008-10-31 21:56:53 +00:00
Ralph Castain	f54fda489e	This is a first step towards supporting fully-routed OOB communications: 1. remove direct routed module (hooray!) 2. add radix tree routed module (binomial remains default) 3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess 4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support 5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability 6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon) Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm. This will require an autogen... This commit was SVN r19866.	2008-10-31 21:10:00 +00:00
Ralph Castain	30b3bc6761	Minor update - provide one more helpful hint regarding stdin target out-of-range, ensure we exit cleanly since daemons won't have been launched. This commit was SVN r19847.	2008-10-29 16:00:48 +00:00
Ralph Castain	82ece176d5	Sanity check needs to allow vpid_invalid as this indicates the "none" scenario This commit was SVN r19820.	2008-10-28 14:50:26 +00:00
Ralph Castain	71dcf61f9b	Add sanity check to ensure that specified stdin target is within range of job. Print error message and exit if not. Modify read_write test to allow specification of rank to read stdin. IOF now validated to work for arbitrary rank as stdin target. Not validate to work for multiple simultaneous ranks reading stdin (untested). This commit was SVN r19804.	2008-10-25 14:38:06 +00:00
George Bosilca	61317cb61d	Complete the r19767 commit for XGrid, i.e. allow the PLM Xgrid to build. This commit was SVN r19777. The following SVN revision numbers were found above: r19767 --> open-mpi/ompi@6e5d844c36	2008-10-21 15:37:22 +00:00
Jeff Squyres	6d026b86b7	Fix a problem reported on the user list by Teng Lin: OPAL_PREFIX wasn't exported in the Bourne-shell-flavor case on remote nodes. This commit was SVN r19770.	2008-10-18 12:13:10 +00:00
Ralph Castain	6e5d844c36	Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 1. completely and cleanly separates responsibilities between the HNP, orted, and tool components. 2. removes all wireup messaging during launch and shutdown. 3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol. 4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0. 5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none". 6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout. 7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output" This is not intended for the 1.3 release as it is a major change requiring considerable soak time. This commit was SVN r19767.	2008-10-18 00:00:49 +00:00
Jeff Squyres	e34c93c46a	Fix problem of missing ) noted by Mostyn Lewis. This commit was SVN r19758.	2008-10-17 16:03:17 +00:00
Ralph Castain	48c3de1865	Fix a problem in the plm "failed to start" code observed by Jeff. When we are unable to launch to a specific node because it doesn't exist or is down, the system would hang and/or segv. The reason for the hang was that we were "firing" the orted exit trigger prior to its timer event being defined - thus "locking" that one-shot and preventing it from firing when we actually were ready to use it. The segv was caused by the fact that we don't really know which daemon failed to start (at least, in most cases), so we didn't set a pointer to the aborted proc object. All we really wanted, though, was to ensure that mpirun returned a non-zero exit status, so the fix was to simply return the default error status. This commit was SVN r19754.	2008-10-16 14:21:37 +00:00
Ralph Castain	a7afa869af	Bring Jeff's changes over from v1.2 that restores the automatic source of .profile for bash and ksh shells. This commit was SVN r19709.	2008-10-08 14:21:42 +00:00
Ralph Castain	f4f81c7308	Let the HNP only update the routing tree if necessary. Enable some debug output This commit was SVN r19676.	2008-10-03 13:41:08 +00:00
Ralph Castain	037231fbcb	MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required. This commit was SVN r19637.	2008-09-25 13:39:08 +00:00
Ralph Castain	16e4b0b698	Ensure that a child job inherits its parent job's prefix dir during comm_spawn operations This commit was SVN r19538.	2008-09-10 19:05:23 +00:00
Ralph Castain	f326ee356e	Add some error output to the plm rsh This commit was SVN r19532.	2008-09-10 01:59:49 +00:00
Ralph Castain	9b8473fdbf	Cleanup orted cmd line - we don't need to pass nodenames, and shouldn't pass heartbeat unless the orted is going to use it. This helps shorten the cmd line for future use. Cleanup when an orted actually opens the PLM. Unfortunately, some unmentionable people are pushing head node environs out to remote nodes, causing the daemons to think they are the HNP. This helps prevent the confusion. This commit was SVN r19518.	2008-09-08 15:45:11 +00:00
Shiqing Fan	cd6ff74d89	Update the ccp module: rename the get_cluster_message function for both ras/plm. use _umask instead of umask. add WIN32_DCOM definition to support Windows Vista. This commit was SVN r19470.	2008-09-01 16:35:38 +00:00
Ralph Castain	43f8bcfe54	Update slurm plm to respect leave_session_attached This commit was SVN r19370.	2008-08-19 18:30:30 +00:00
Ralph Castain	4e0f34a062	When we hit an error prior to actually launching daemons, it would be nice if orterun didn't bark about daemons failing to launch, mpirun detecting a job failed, etc. Add a new job state to indicate that we never attempted to launch. Flag such a scenario and avoid hitting all the other error messages. This commit was SVN r19366.	2008-08-19 15:19:30 +00:00
Ralph Castain	49745c5f40	Provide a new option that allows a user to leave an ssh session open without getting deluged by ORTE debug output. The new option is --leave-session-attached, with a corresponding MCA param of orte_leave_session_attached. Theoretically, any PLM could use this - but in reality, all of them except rsh/ssh already leave the session attached anyway. This fixes trac:656 - a REALLY old ticket This commit was SVN r19294. The following Trac tickets were found above: Ticket 656 --> https://svn.open-mpi.org/trac/ompi/ticket/656	2008-08-14 18:59:01 +00:00
Ralph Castain	30f37f762d	Enable co-location of debugger daemons during initial launch and when debugging a running job. Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented. No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release. This commit was SVN r19275.	2008-08-13 17:47:24 +00:00
Ralph Castain	f017c55bfa	Close a minor memory leak - we can reuse timer events This commit was SVN r19251.	2008-08-12 12:53:30 +00:00
Ralph Castain	be02211b4f	Modify the wakeup system to make it more Windows-friendly. This allows Shiqing to consolidate the Windows-specific modifications into one location, and generalizes the wakeup procedure in case we hit other system-specific requirements. This needs some soak time to ensure we haven't opened any race conditions. I tried to loop everything in the shutdown procedure through that trigger event call to ensure it all goes through the one-time locks as it did before so that someone hitting ctrl-c when we are already shutting down shouldn't cause problems. Just want to let people use it for awhile to verify. This commit was SVN r19159.	2008-08-05 15:09:29 +00:00
Ralph Castain	5b2f53a069	One more quick fix - ensure we are looking at the value and not its pointer This commit was SVN r19123.	2008-08-01 23:39:55 +00:00
Jeff Squyres	26c7daf16a	Fix typo This commit was SVN r19121.	2008-08-01 21:30:53 +00:00
Ralph Castain	21cd4b9df8	Add pls_rsh_agent synonym to the PLM rsh component This commit was SVN r19119.	2008-08-01 20:15:42 +00:00
Ralph Castain	a62b2a0150	Per the July technical meeting: Standardize the handling of the orte launch agent option across PLMs. This has been a consistent complaint I have received - each PLM would register its own MCA param to get input on the launch agent for remote nodes (in fact, one or two didn't, but most did). This would then get handled in various and contradictory ways. Some PLMs would accept only a one-word input. Others accepted multi-word args such as "valgrind orted", but then some would error by putting any prefix specified on the cmd line in front of the incorrect argument. For example, while using the rsh launcher, if you specified "valgrind orted" as your launch agent and had "--prefix foo" on you cmd line, you would attempt to execute "ssh foo/valgrind orted" - which obviously wouldn't work. This was all -very- confusing to users, who had to know which PLM was being used so they could even set the right mca param in the first place! And since we don't warn about non-recognized or non-used mca params, half of the time they would wind up not doing what they thought they were telling us to do. To solve this problem, we did the following: 1. removed all mca params from the individual plms for the launch agent 2. added a new mca param "orte_launch_agent" for this purpose. To further simplify for users, this comes with a new cmd line option "--launch-agent" that can take a multi-word string argument. The value of the param defaults to "orted". 3. added a PLM base function that processes the orte_launch_agent value and adds the contents to a provided argv array. This can subsequently be harvested at-will to handle multi-word values 4. modified the PLMs to use this new function. All the PLMs except for the rsh PLM required very minor change - just called the function and moved on. The rsh PLM required much larger changes as - because of the rsh/ssh cmd line limitations - we had to correctly prepend any provided prefix to the correct argv entry. 5. added a new opal_argv_join_range function that allows the caller to "join" argv entries between two specified indices Please let me know of any problems. I tried to make this as clean as possible, but cannot compile all PLMs to ensure all is correct. This commit was SVN r19097.	2008-07-30 18:26:24 +00:00
George Bosilca	a4d905db4a	Allow xgrid to compile. This commit was SVN r19076.	2008-07-29 13:24:08 +00:00
Jeff Squyres	0af7ac53f2	Fixes trac:1392, #1400 * add "register" function to mca_base_component_t * converted coll:basic and paffinity:linux and paffinity:solaris to use this function * we'll convert the rest over time (I'll file a ticket once all this is committed) * add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier * new mca_base_component_t size: 196 bytes * new mca_base_component_data_2_0_0_t size: 36 bytes * MCA base version bumped to v2.0 * '''We now refuse to load components that are not MCA v2.0.x''' * all MCA frameworks versions bumped to v2.0 * be a little more explicit about version numbers in the MCA base * add big comment in mca.h about versioning philosophy This commit was SVN r19073. The following Trac tickets were found above: Ticket 1392 --> https://svn.open-mpi.org/trac/ompi/ticket/1392	2008-07-28 22:40:57 +00:00
Thomas Herault	28dc80b67e	Deal with the SIGCHLD issue in LSF. lsb_launch tampers with SIGCHLD signal handler. We are forced to reinstall our own signal handler after a call to this function. This commit fixes trac:1356. This commit was SVN r19033. The following Trac tickets were found above: Ticket 1356 --> https://svn.open-mpi.org/trac/ompi/ticket/1356	2008-07-25 15:23:23 +00:00
Thomas Herault	b6affd35e9	Small typos for LSF compilation and update Makefile.am This commit was SVN r18998.	2008-07-23 14:42:26 +00:00
George Bosilca	bcac9a0540	Remove a warning about using map when it is not initialized. This commit was SVN r18957.	2008-07-21 14:35:05 +00:00
Ralph Castain	b1f367563c	Few minor code cleanups This commit was SVN r18890.	2008-07-11 15:40:41 +00:00
Ralph Castain	58964b2bf8	Make lsf support compile This commit was SVN r18889.	2008-07-11 15:40:25 +00:00
Shiqing Fan	67a842fd17	Too many actual parameters for this function, remove the wrong one in order to get rid of the compiler warnings. This commit was SVN r18863.	2008-07-10 08:06:53 +00:00
Pak Lui	924bface15	The plm env var should set to the name of a current plm module, which is rsh. This commit was SVN r18844.	2008-07-08 23:15:52 +00:00
Shiqing Fan	5d0f4dc88d	- Clean up the unreferenced variables. - Change the arguments for launch failed function according to changeset r18611. This commit was SVN r18795. The following SVN revision numbers were found above: r18611 --> open-mpi/ompi@7bee71aa59	2008-07-03 10:11:08 +00:00
Shiqing Fan	a3e1718126	Missing one argument for calling this function. This commit was SVN r18793.	2008-07-01 18:01:22 +00:00
Ralph Castain	9cebe0ca96	Ckpt the bproc support. All compiles now except for PLM module This commit was SVN r18744.	2008-06-26 03:48:22 +00:00
Ralph Castain	17fcd72b5d	Restore bproc code - if someone wants to maintain it, then more power to them...but it would definitely be easier if the old code is in the trunk. This is all .ompi_ignore'd except for me so I can play with making it compile again in my copious free time. This commit was SVN r18716.	2008-06-24 01:27:22 +00:00
Ralph Castain	3e61a3f92e	Sandbox for next-gen launch This commit was SVN r18715.	2008-06-24 01:25:51 +00:00
Ralph Castain	f799ea225f	Orterun creates a "clean" copy of its environment for use in launching procs. This includes properly setting LD_LIBRARY_PATH and PATH, among other things. Unfortunately, our PLM modules were using the local environ instead of the saved copy, thus missing a number of things that really should have been included. From what I see, we got away with the error because the PLM's were duplicating all that setup logic themselves - I'll clean this up over the next few days. Meantime, correct the PLM's so they use the correct environ for launching. This commit was SVN r18713.	2008-06-23 22:39:36 +00:00
Ralph Castain	b56f8ced4f	Ensure params are registered prior to parsing global cmd line options in orterun so that debugger options are properly captured and acted upon. Ensure that routes to remote procs are set on the HNP before completing launch so that the debugger message can be sent. Solves a race condition that can exist in those environments where the HNP does not have local procs. This commit was SVN r18674.	2008-06-19 02:58:14 +00:00
Ralph Castain	282a220e7e	Update the debugger interface per email thread with Jeff and Brian. Handoff to them for final test and validation This commit was SVN r18670.	2008-06-18 15:28:46 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	1a422995ae	Fix two Coverity complaints CID 813 (value defined and not used) and 1039 (resource leak). While doing so, found and fixed another less obvious memory leak. This commit was SVN r18641.	2008-06-10 17:53:28 +00:00
George Bosilca	f72ab90b16	Allow xgrid to compile again. This commit was SVN r18631.	2008-06-09 21:51:41 +00:00
Ralph Castain	c13cadc3c7	Refs trac:1255 This commit repairs the debugger initialization procedure. I am not closing the ticket, however, pending Jeff's review of how it interfaces to the ompi_debugger code he implemented. There were duplicate symbols being created in that code, but not used anywhere. I replaced them with the ORTE-created symbols instead. However, since they aren't used anywhere, I have no way of checking to ensure I didn't break something. So the ticket can be checked by Jeff when he returns from vacation... :-) This commit was SVN r18625. The following Trac tickets were found above: Ticket 1255 --> https://svn.open-mpi.org/trac/ompi/ticket/1255	2008-06-09 20:34:14 +00:00
Ralph Castain	bf5c34d10a	The rsh launcher is one place where multi-word MCA params would have to be passed via the orted cmd line. In such a case, we have to explicitly include quote marks about the param value. Add that capability here. This commit fixes trac:1200 This commit was SVN r18621. The following Trac tickets were found above: Ticket 1200 --> https://svn.open-mpi.org/trac/ompi/ticket/1200	2008-06-09 19:07:19 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Pak Lui	caac0e0182	Add in a couple missing ones from r18611 for all tm users out there... This commit was SVN r18615. The following SVN revision numbers were found above: r18611 --> open-mpi/ompi@7bee71aa59	2008-06-06 22:53:43 +00:00
Ralph Castain	b65eb54ea2	Cut out a new iof pull - that capability isn't ready yet for the trunk, but will be coming shortly Thanks to Pak for letting me know... This commit was SVN r18614.	2008-06-06 21:24:15 +00:00
Ralph Castain	7bee71aa59	Fix a potential, albeit perhaps esoteric, race condition that can occur for fast HNP's, slow orteds, and fast apps. Under those conditions, it is possible for the orted to be caught in its original send of contact info back to the HNP, and thus for the progress stack never to recover back to a high level. In those circumstances, the orted can "hang" when trying to exit. Add a new function to opal_progress that tells us our recursion depth to support that solution. Yes, I know this sounds picky, but good ol' Jeff managed to make it happen by driving his cluster near to death... Also ensure that we declare "failed" for the daemon job when daemons fail instead of the application job. This is important so that orte knows that it cannot use xcast to tell daemons to "exit", nor should it expect all daemons to respond. Otherwise, it is possible to hang. After lots of testing, decide to default (again) to slurm detecting failed orteds. This proved necessary to avoid rather annoying hangs that were difficult to recover from. There are conditions where slurm will fail to launch all daemons (slurm folks are working on it), and yet again, good ol' Jeff managed to find both of them. Thanks you Jeff! :-/ This commit was SVN r18611.	2008-06-06 19:36:27 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
Ralph Castain	332e6c89ab	Modify the slurm launcher so that the kill-on-bad-exit behavior is not "on" by default. Instead, only turn it "on" if the plm_slurm_detect_failure mca param is set to something non-zero This commit was SVN r18588.	2008-06-04 23:59:53 +00:00
George Bosilca	25ae9c12e6	Silence few warnings. This commit was SVN r18568.	2008-06-03 19:58:40 +00:00
George Bosilca	fa89d299bf	Silence the Obj-C compiler. This commit was SVN r18567.	2008-06-03 19:24:17 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	95578b0528	Fix single-node operations so that the HNP correctly exits when the job completes This commit was SVN r18556.	2008-06-03 14:23:04 +00:00
Ralph Castain	b456fb2d42	Upgrade the node/orted failure detection code to cover all environments. Use the native environment's capabilities where possible - e.g., SLURM detects orted failure and can report it. Elsewhere, use a heartbeat system to detect orted failure - e.g., for TM and rsh. Heart rate is set via mca param. The HNP checks for callback every 2heartrate, declares orted failure if not seen in last 2heartrate time. Also detect orted failed-to-start by setting timeout on launch. Currently only used in TM launcher. Neither detection is enabled by default, but are only active if heartrate is set and/or launch timeout is set. Exception for SLURM as orted failure is always detected and reported. More info to come on devel list. This commit was SVN r18555.	2008-06-02 21:46:34 +00:00
Shiqing Fan	af656b2b3d	Fix some typing mistakes, make the sources compile again for Windows Visual Studio. This commit was SVN r18542.	2008-05-29 15:27:43 +00:00
Ralph Castain	2b28bef15a	Provide a "nicer" indication that we don't know the pid of the failed orted This commit was SVN r18538.	2008-05-29 14:10:58 +00:00
Ralph Castain	72530f8fed	Cleanly handle the failed start of an orted, or its unexpected failure after start. This commit will allow mpirun to exit cleanly when this occurs, and does a best-effort attempt to cleanup the mess. However, it still has two unresolved issues that need to be eventually addressed: 1. it depends upon the ability of the native environment to alert us that the orted has died/failed to start. I have included that support for SLURM, but other environments need to be done. 2. for some yet-to-be-determined reason, the message that tells the remaining daemons to "die" isn't getting out of the RML, even though no obvious blockage is standing in the way. Work will continue on resolving that problem. For now, the orteds appear to be exiting on their own quite nicely when they see their HNP "lifeline" disappear. This represents the best-available fix for ticket #221 so I am closing that ticket at this time. This commit was SVN r18536.	2008-05-29 13:38:27 +00:00
Ralph Castain	52fb773c6c	Tell slurm to kill the job if an orted abnormally exits This commit was SVN r18535.	2008-05-29 12:26:58 +00:00
Ralph Castain	f76240e7cc	Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch. This commit was SVN r18528.	2008-05-28 18:38:47 +00:00
Pak Lui	695c158192	silence some intel and pgcc compiler warnings. This commit was SVN r18501.	2008-05-26 20:35:13 +00:00
Pak Lui	7b3d7dcac4	This commit closes trac:1300. This commit was SVN r18473. The following Trac tickets were found above: Ticket 1300 --> https://svn.open-mpi.org/trac/ompi/ticket/1300	2008-05-21 22:35:04 +00:00
Jeff Squyres	671f0c379d	Remove a whole pile of orte/util/show_help.h's that I missed. :-( This commit was SVN r18437.	2008-05-14 11:32:33 +00:00
Pak Lui	4c8d79d907	Silence the compiler warnings/errors. There is no orte/util/show_help.h This commit was SVN r18436.	2008-05-13 22:07:38 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Shiqing Fan	7ff440f628	Add quotation marks for windows path. This commit was SVN r18420.	2008-05-09 14:12:09 +00:00
George Bosilca	dbea3e070e	Correct some copy/paste errors. This commit was SVN r18396.	2008-05-07 04:04:42 +00:00
Pak Lui	0302c098be	minor typo This commit was SVN r18386.	2008-05-06 21:26:17 +00:00
Ralph Castain	d97a4f880d	Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation. Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two. Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day. This commit was SVN r18385.	2008-05-06 20:16:17 +00:00
Josh Hursey	c47406810e	Fix AMCA orted command line. If no AMCA parameters are passed then do not send across the path information. Only place it on the command line if the AMCA parameter is set. This commit was SVN r18382.	2008-05-06 18:27:31 +00:00
Josh Hursey	9971bc9d95	Merge in the mca_base_select changes per RFC: http://www.open-mpi.org/community/lists/devel/2008/04/3779.php {{{ svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play . }}} Any components not in the trunk, but in one of the effected frameworks must be updated. Contact the list, look at the RFC, or look at the diff for how to do this. Sorry for the early commit of this, but I wanted to get it in today (per RFC) and didn't know if I would have a chance later today. This commit was SVN r18381.	2008-05-06 18:08:45 +00:00
Ralph Castain	40904dd152	Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later. Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way. Cleanup ignore properties on the new routed components. This commit was SVN r18377.	2008-05-05 22:32:25 +00:00
Ralph Castain	b2c73f6e11	Fix tree-spawn to work within the new modex system This commit was SVN r18349.	2008-05-01 19:19:34 +00:00
Ralph Castain	ad894b050b	Set the bookmark so the first process of a comm_spawn'd job will be mapped to the same node as the spawning proc, assuming it has space. If not, then the mapper will automatically move to the next node. This commit was SVN r18346.	2008-05-01 15:24:03 +00:00
Ralph Castain	1766442591	Fix a double-free when tree-spawning Fix the round-robin mapper so it doesn't move to the next node just because it completed mapping an app_context This commit was SVN r18344.	2008-05-01 14:49:56 +00:00
Ralph Castain	3e55fe6f6d	Fold in the revised modex scheme. Move the ompi_proc_t modex portions to the RTE level since the daemons already have that info. Provide each process with the equivalent of a "nidmap" - both a map of what nodes are in the job, and a map of which node each process is on. This enables the use of static ports, though that hasn't been turned "on" in this commit. Update the rsh tree spawn capability so we spawn the next wave of daemons before launching our own local procs. Add an ability to encode nodenames for large clusters with contiguous node name numbering schemes - this allows communication of all node names in a few bytes instead of tens-of-bytes/node. This commit was SVN r18338.	2008-04-30 19:49:53 +00:00
Ralph Castain	4c2c6c9bd8	Ensure the pack/unpacks match for tree-spawn This commit was SVN r18282.	2008-04-24 18:53:08 +00:00
Ralph Castain	09b6758f8c	Pass the prefix dir to the remote orted when doing tree-based spawns This commit was SVN r18280.	2008-04-24 18:38:24 +00:00
Shiqing Fan	eb5f5d77cc	If it's not the HNP, release the cluster object first and return. This commit was SVN r18247.	2008-04-23 13:21:32 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Ralph Castain	a4ea756a76	Ensure the node loop cntr gets incremented if the daemon already exists This commit was SVN r18150.	2008-04-15 14:20:03 +00:00
Ralph Castain	35c260a14f	Fix the plm modules to accommodate the new remote_spawn entry - set that entry to NULL for all but rsh as only that module supports it at this time This commit was SVN r18145.	2008-04-14 19:36:13 +00:00
Ralph Castain	7c7304466c	Add a binomial tree-based launch to ssh, turned "on" only when the plm_rsh_tree_spawned mca param is set to a non-zero value. This probably isn't a very optimized capability, but it does execute a tree-based launch that may scale better than linear at high node counts. Add the daemon map capability to the ODLS to create and save a map of daemon vpid vs nodename from the launch message. Cleanup a few places in the base plm launch support where we didn't adequately protect rml recv's from potentially executing sends. This commit was SVN r18143.	2008-04-14 18:26:08 +00:00
Ralph Castain	851279fc9f	Consolidate the daemon wireup message into the launch message. The daemons don't need their contact info prior to the launch message anyway. This not only eliminates a job-wide communication from the startup procedure, but it also resolves a race condition reported when operating across highly distributed (i.e., cross-country) networks. In such scenarios, it proved possible for a daemon to receive its launch message -before- it had received the contact info message, even though the latter had been sent first! This eliminates that problem... This commit was SVN r18126.	2008-04-10 15:35:11 +00:00
Ralph Castain	57e3e86cda	Use the proper exit code for mpirun to indicate an error when something goes wrong during launch (in scenarios where the procs don't report the problem directly themselves) This commit was SVN r18121.	2008-04-10 09:15:08 +00:00
Ralph Castain	22343e6e0b	Given total lack of interest/support from the folks behind these environments, and the fact that we can now scale so well with our own daemons, it seems unlikely that we will be able to pursue direct and/or standalone launch in these environments. If that situation ever changes, it is easy enough to revive the effort since little had really been done to-date. Meantime, no reason to continue dragging these around. This commit was SVN r18119.	2008-04-10 02:54:13 +00:00
Ralph Castain	3a0d09300b	Fully implement the inbound binomial allgather for daemon-based collectives. Supports both modex and barrier operations. Comm_spawn still uses the rank=0 method - shifting that algo to the daemons is under study. This commit was SVN r18115.	2008-04-09 22:10:53 +00:00
Ralph Castain	f3936ff9bc	Record the daemon's state so that we don't attempt to send "die" messages to a daemon that is known to have failed to start. This commit was SVN r18044.	2008-03-31 18:15:24 +00:00

1 2 3 4

176 Коммитов