openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	4efddc7b0a	Fix the allgather and allgather_list functions to avoid deadlocks at large node/proc counts. Violated the RML rules here - we received the allgather buffer and then did an xcast, which causes a send to go out, and is then subsequently received by the sender. This fix breaks that pattern by forcing the recv to complete outside of the function itself - thus, the allgather and allgather_list always complete their recvs before returning or sending. Reogranize the grpcomm code a little to provide support for soon-to-come new grpcomm components. The revised organization puts what will be common code elements in the base to avoid duplication, while allowing components that don't need those functions to ignore them. This commit was SVN r17941.	2008-03-24 20:50:31 +00:00
Ralph Castain	58d51f2689	Revert that! Need to complete the rest of the change so the orted knows the correct nodeid... Sorry This commit was SVN r17939.	2008-03-24 18:17:26 +00:00
Ralph Castain	dae4518878	Use the correct nodeid! This commit was SVN r17938.	2008-03-24 18:15:08 +00:00
Ralph Castain	dc7f45dafd	Remove the obsolete and largely unused orte_system_info structure. The only fields that were used in that struct were nodeid and nodename - these have been transferred to the orte_process_info structure. Only one place used the user name field - session_dir, when formulating the name of the top-level directory. Accordingly, the code for getting the user's id has been moved to the session_dir code. This commit was SVN r17926.	2008-03-23 23:10:15 +00:00
Ralph Castain	f8642e9390	Add debug to tell us when we opened a socket and to whom This commit was SVN r17911.	2008-03-21 15:47:47 +00:00
Ralph Castain	19ffdfef42	Add some debugging output to tell us what interfaces were considered and used by OOB This commit was SVN r17909.	2008-03-21 15:35:40 +00:00
Ralph Castain	c2fd5dd416	Clarify method used to translate application proc termination codes to exit status codes This commit was SVN r17899.	2008-03-20 18:50:05 +00:00
Brian Barrett	2bf4784893	Set a meaningful orte_system_info.nodeid on Catamount This commit was SVN r17898.	2008-03-20 16:55:57 +00:00
Ralph Castain	f8a10dfb93	Complete the fix of the orted vs mpirun race condition for finalizing. The darned mpirun is just too fast! Rather than try to slow it down, we set the orte_finalizing flag -prior- to telling mpirun the orted is leaving. This ensures we don't mistakenly declare the lifeline lost when mpirun leaves in a hurry. This commit was SVN r17897.	2008-03-20 16:55:24 +00:00
Ralph Castain	6bb139e4f2	One more correction to mpirun exit codes - cleanup the application proc's exit codes in the orted so that non-zero exit codes generated by mpirun itself don't get "munged". Modify the multi_abort function so they all return different exit codes - allows us to tell which one was being reported. This commit was SVN r17895.	2008-03-20 13:54:11 +00:00
Ralph Castain	27a73ad9ee	Fix a race condition between the orteds and HNP that can cause the orteds to output the "lost lifeline" message. This has been a long-time problem. I tried to reduce the problem by having the orteds tell the HNP they were finalizing, and having the HNP wait until all orteds had reported or we timed out. What was observed was that all the orteds were correctly reporting that they are leaving, but the HNP is able to exit before the orteds, thus closing the orteds lifeline socket and generating the error output. This is caused by the fact that the orteds have to whack all remaining session directories, which includes that blasted monster shared memory file! Cleaning up the SM file can take quite a while. The HNP doesn't have that problem as there is no SM file there! So it gets out first. What we had done in the past to resolve that problem was put a little test in the OOB that checks to see if we are finalizing. If we are, then we ignore the lifeline connection being lost. That check was still in the code - however, we had lost the line in orte_finalize that set the flag!! This commit was SVN r17893.	2008-03-20 13:30:51 +00:00
Ralph Castain	8ee26a55ca	Just turn these off for now - will revisit later This commit was SVN r17891.	2008-03-20 13:25:35 +00:00
Ralph Castain	67a2cc8a8e	Fix a bug noted by Tim P where we would report the incorrect app_context as "not found". If you gave us the command line: mpirun -n 1 hostname : -n 1 bogus we would erroneously report that hostname had not been found instead of bogus. This commit was SVN r17886.	2008-03-19 21:13:13 +00:00
Ralph Castain	ec64bf3da8	Clarify the error output so we can understand if it was a daemon or process that lost its lifeline This commit was SVN r17880.	2008-03-19 19:06:52 +00:00
Ralph Castain	2ed0e60321	Bring some sanity to the exit code returned by mpirun. Ensure that we provide a non-zero code if something goes wrong, including someone exiting after calling mpi_init without calling mpi_finalize. Jeff is preparing an (undoubtedly lengthy) explanation/matrix of how these codes are determined for the OMPI FAQ. This commit was SVN r17879.	2008-03-19 19:00:51 +00:00
Galen Shipman	80ac7c87cd	don't forget command file.. This commit was SVN r17878.	2008-03-19 16:24:29 +00:00
Galen Shipman	77c8532cc9	do things in a less hacky way.. This commit was SVN r17877.	2008-03-19 16:23:56 +00:00
Jeff Squyres	ac2e329353	Oops! That should not have been removed... This commit was SVN r17865.	2008-03-18 14:42:30 +00:00
Jeff Squyres	bd92720d41	More fixes to make it compile and play nice on OS X. Still more fixes are required; sending mail to devel shortly... This commit was SVN r17864.	2008-03-18 14:38:52 +00:00
Ralph Castain	8f31a62600	Fix compilation errors so this will compile, remove unused variables This commit was SVN r17862.	2008-03-18 13:01:26 +00:00
Lenny Verkhovsky	647bce6d3e	Support for new RMAPS rank mapping component This commit was SVN r17860.	2008-03-18 09:39:07 +00:00
Lenny Verkhovsky	14c32f87d5	Added new RMAPS component for rank mapping This commit was SVN r17859.	2008-03-18 09:33:49 +00:00
Ralph Castain	8cd6142e6d	Add some debugging to the grpcomm module. Setting grpcomm_base_verbose = 1 will now give you a trace through the functions as they are called. Setting it to 2 or more will give you details on what each function is doing as it works through its procedure. This commit was SVN r17848.	2008-03-17 19:34:36 +00:00
Ralph Castain	629b95a2fe	Afraid this has a couple of things mixed into the commit. Couldn't be helped - had missed one commit prior to running out the door on vacation. Fix race conditions in abnormal terminations. We had done a first-cut at this in a prior commit. However, the window remained partially open due to the fact that the HNP has multiple paths leading to orte_finalize. Most of our frameworks don't care if they are finalized more than once, but one of them does, which meant we segfaulted if orte_finalize got called more than once. Besides, we really shouldn't be doing that anyway. So we now introduce a set of atomic locks that prevent us from multiply calling abort, attempting to call orte_finalize, etc. My initial tests indicate this is working cleanly, but since it is a race condition issue, more testing will have to be done before we know for sure that this problem has been licked. Also, some updates relevant to the tool comm library snuck in here. Since those also touched the orted code (as did the prior changes), I didn't want to attempt to separate them out - besides, they are coming in soon anyway. More on them later as that functionality approaches completion. This commit was SVN r17843.	2008-03-17 17:58:59 +00:00
Josh Hursey	aaff245271	A couple verbose additions. Poll the event engine while waiting for the named pipe. This commit was SVN r17787.	2008-03-07 21:10:14 +00:00
Galen Shipman	0fb6cf0916	make output use verbose macro.. This commit was SVN r17778.	2008-03-07 03:06:17 +00:00
Shiqing Fan	eb1dfaf4d5	Select the windows CCP component at runtime by testing if we are on Windows cluster. This commit was SVN r17776.	2008-03-07 01:31:53 +00:00
Ralph Castain	b110a247be	Fix comm_spawn (maybe). Comm_spawn was sticking during spawn_multiple because of a problem in the dpm - the modex there is asking processes to talk to each other in an allgather_list operation, but the procs don't have the required contact info to do so. The solution here was to ensure that all parent procs have full contact info for procs in the child job. Admittedly, this isn't the long-term answer. We would like to have the contact info given to only the parent procs that were involved in the comm_spawn. There is a way to do that, but this will suffice to keep things working until that can be implemented and tested. This commit was SVN r17772.	2008-03-06 21:56:00 +00:00
Ralph Castain	64d43cc44b	Fix the unity routed component and direct xcast mode. Ensure that direct xcast handles all its use-cases correctly. Unity routed component needs to use the base recv function to properly operate. This commit was SVN r17764.	2008-03-06 18:13:05 +00:00
Ralph Castain	ff99aa054f	In order to prevent orphaned processes when using non-unity routing methods, the procs need to realize that their local daemon is a critical connection - if that connection unexpectedly closes, they need to terminate. This commit adds definition for a "lifeline" connection. For an HNP, there is no lifeline, so the lifeline proc is NULL. For a daemon, the lifeline is the HNP - the daemon should abort if it loses that connection. For a proc using unity routed, the lifeline is the HNP since it connects directly to the HNP. For a proc using tree routed, the lifeline is the local daemon. Adjusted OOB to call abort if the lifeline (as opposed to HNP) connection is lost. This commit was SVN r17761.	2008-03-06 15:30:44 +00:00
Josh Hursey	0b4d9a12ce	a bit more verbosity for the fun of it This commit was SVN r17758.	2008-03-06 14:04:25 +00:00
Tim Prins	f61c2333c0	Remove unneeded field, and the two uses of it. This commit was SVN r17757.	2008-03-06 12:46:36 +00:00
Tim Prins	d56f19c77d	Fix logic error, and remove uneeded checks for invalid results. This commit was SVN r17756.	2008-03-06 04:38:13 +00:00
Ralph Castain	6d94e7b232	Fix the debug output so it correctly reports launch state This commit was SVN r17755.	2008-03-06 03:11:01 +00:00
Tim Prins	5de3e1965e	Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte. Everything should work, however I am unable to compile and test the sctp BTL. This commit was SVN r17751.	2008-03-05 22:44:35 +00:00
Tim Prins	f9916811ae	Make it so we do not mangle the options the user passes to their executeable. Fixes trac:1124 The change also: - cleans up and simplifies the command line processing code - adds an error output if more than one hostfile passed for a single app context - gets rid of the superfluous orte_app_context_map_t type, and instead use a simple argv of -host options This commit was SVN r17750. The following Trac tickets were found above: Ticket 1124 --> https://svn.open-mpi.org/trac/ompi/ticket/1124	2008-03-05 22:12:27 +00:00
Rolf vandeVaart	03fdd57d5a	Fix the use of --path and -x PATH so that things work properly. Note that --path specifies extra directories where the executable is searched for, but does not affect the PATH settings. This commit fixes trac:1221. This commit was SVN r17748. The following Trac tickets were found above: Ticket 1221 --> https://svn.open-mpi.org/trac/ompi/ticket/1221	2008-03-05 21:07:43 +00:00
Ralph Castain	4dbc352828	Per request, change name of new enviro var to OMPI_COMM_WORLD_LOCAL_SIZE This commit was SVN r17736.	2008-03-05 14:45:26 +00:00
Ralph Castain	06d3145fe4	First cut at direct launch for TM. Able to launch non-ORTE procs and detect their completion for a clean shutdown. This commit was SVN r17732.	2008-03-05 13:51:32 +00:00
George Bosilca	c71f225a28	These functions should only be compiled when OPAL_ENABLE_FT == 1. This commit was SVN r17727.	2008-03-05 05:57:13 +00:00
Josh Hursey	3b4073e32c	This commit fixes the checkpoint/restart functionality on the trunk. Included in this commit are: * Extension to the ESS framework to support C/R * Fixed support for {{{snapc_base_establish_global_snapshot_dir}}} * Fixed FileM support * Misc. minor code modifications There are some outstanding visability issues that I want to fix next. This commit was SVN r17725.	2008-03-05 04:57:23 +00:00
Ralph Castain	edb8e32a7a	Add default hostfile parameter plus --default-hostfile command line option. Fix error message when job setup failed This commit was SVN r17724.	2008-03-05 04:54:57 +00:00
Ralph Castain	022fc1f382	Add another MPI-related enviro variable OMPI_COMM_WORLD_NUM_LOCAL_PROCS This commit was SVN r17723.	2008-03-05 04:53:32 +00:00
Ralph Castain	e745c16ff1	Modify the enviro variable names to be OMPI_... Add two new ones: OMPI_COMM_WORLD_LOCAL_RANK and OMPI_UNIVERSE_SIZE This commit was SVN r17694.	2008-03-04 20:16:05 +00:00
Shiqing Fan	ebf9c0441d	Set the windows components invisible. This commit was SVN r17687.	2008-03-04 17:37:17 +00:00
Shiqing Fan	ae41b5418b	Update the RAS and PLM components for Windows. These won't suffer another platforms but only windows. This commit was SVN r17686.	2008-03-04 17:13:01 +00:00
Ralph Castain	ffa232687a	Fix xcast so it works in multi-node situations where the user specifies a particular mode to use (e.g., direct). This commit was SVN r17682.	2008-03-03 20:07:02 +00:00
Ralph Castain	841d0e5208	Cleanup an attribute warning - not sure which one to set or where it should go, so I'll leave that to someone more familiar with "attributes". Ensure some debugging is only enabled when have_debug is set. This commit was SVN r17681.	2008-03-03 16:06:47 +00:00
Rich Graham	d37db14901	get the shared memory collectives working again with the new version of orte. This commit was SVN r17672.	2008-02-29 22:28:57 +00:00
Ralph Castain	6450962d59	Add some debugging to the message event object. Cleanup some no-longer-used values This commit was SVN r17671.	2008-02-29 20:10:31 +00:00
Ralph Castain	a585923de1	Silence some minor compiler warnings This commit was SVN r17662.	2008-02-29 02:39:39 +00:00
Tim Prins	84b2099fe8	Remove the now-unused orte_value_array. As this is the last 'class' split between orte and ompi, remove the big comment about the split in ompi_bitmap. Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h This commit was SVN r17655.	2008-02-28 21:39:42 +00:00
Ralph Castain	5e6928d710	Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv. Created two new macros to assist: ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd. Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output. This has been tested on Mac, TM, and SLURM. This commit was SVN r17647.	2008-02-28 19:58:32 +00:00
Ralph Castain	5dc64cea6a	Correct logic - only issue recv and cancel it if we are an HNP This commit was SVN r17641.	2008-02-28 15:27:16 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Gleb Natapov	da3e69101d	Add missing include. This commit was SVN r17493.	2008-02-18 14:55:02 +00:00
Galen Shipman	18d1d3b408	Add ORTE ALPS support (Cray XT CNL) This commit was SVN r17482.	2008-02-17 19:29:06 +00:00
George Bosilca	fcab6cc0bb	Fix typo. This commit was SVN r17255.	2008-01-26 21:36:04 +00:00
Rainer Keller	9d4852cdc1	- Get rid of Wshadow warnings. This commit was SVN r17231.	2008-01-25 14:07:38 +00:00
Pak Lui	413bcca4c0	Support the qrsh or qsub "-notify" option by catching the SIGUSR1/2 signals and not letting user processes to exit on those signals. This commit was SVN r17174.	2008-01-22 17:32:29 +00:00
Josh Hursey	158dda5458	Fix some overlapping code. This commit was SVN r17067.	2008-01-08 15:40:21 +00:00
George Bosilca	eb71a634c6	Don't forget to initialize the msg_origin field. This commit was SVN r17055.	2008-01-04 23:24:49 +00:00
George Bosilca	48f5a26e8c	Cast to keep VC happy (quiet). This commit was SVN r17054.	2008-01-04 23:13:32 +00:00
Adrian Knoth	42d5fe62f9	Fixed misplaced #endif This commit was SVN r17028.	2008-01-01 11:02:38 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Josh Hursey	f7812baf5b	forgot a bit of error checking in the last commit This commit was SVN r16953.	2007-12-13 14:41:18 +00:00
Josh Hursey	a287c9cb65	This commit distinguishes the file transfer stage from the finish stage. This commit also cleans up the checkpoint and terminate case making it more precise than before. Previously the application could make a small amount of progress between checkpoint completion and application termination. Now the application will make no progress at all in this time span. Additional minor change: - Start using OPAL_INT_TO_BOOL instead of if/else logic This commit was SVN r16952.	2007-12-13 14:37:17 +00:00
Rolf vandeVaart	3ea89b69ae	Remove a few tabs. Allow the output stream to be passed to the close command for verbose output. This matches all the other frameworks. This commit was SVN r16938.	2007-12-11 20:44:56 +00:00
Josh Hursey	27c9016b93	sleep -> usleep so we can be a bit more eager when waiting for events to finish. Still working on solutions that do not involve sleeping, but this will do for now. This commit was SVN r16824.	2007-12-03 19:27:32 +00:00
Jeff Squyres	c20350b943	Patch submitted by Brian Barrett, inspired by this thread: http://www.open-mpi.org/community/lists/users/2007/11/4547.php. - Better handling of ECONNABORTED from connect on Linux. - Reduce extraneous output from OOB when TCP connections must be retried. This commit was SVN r16808.	2007-11-30 21:42:15 +00:00
Ron Brightwell	edb9d8e354	Added Catamount to the conditional compilation since Catamount doesn't support fork() or pipe() either. This removes a linker warning message when building for Cray XT with Catamount. This commit was SVN r16772.	2007-11-21 21:37:58 +00:00
George Bosilca	d67c0eefb4	Remove a compilation warning about using uninitialized variables. This commit was SVN r16589.	2007-10-26 20:15:28 +00:00
George Bosilca	b1b5cb6453	Looks like SO_REUSEPORT it's not defined on some platforms. Switch to the conventional SO_REUSEADDR instead. This commit was SVN r16588.	2007-10-26 19:56:21 +00:00
George Bosilca	337f78a4a8	Restrict the port range for the OOB and the BTL. Each protocols (v4 and v6) has his own range which is defined by a min value and a range. By default there is no limitation on the port range, which is exactly the same behavior as before. This commit was SVN r16584.	2007-10-26 16:36:51 +00:00
Jeff Squyres	9e4387d021	* Use new BEGIN_C_DECLS / END_C_DECLS convention * Add newline at end of file to avoid compiler warning This commit was SVN r16579.	2007-10-26 13:40:38 +00:00
Shiqing Fan	3c38c9c020	- Add extern "C" to resolve linkage specification problems. This commit was SVN r16577.	2007-10-26 09:54:42 +00:00
Ralph Castain	a791ce2299	The processor affinity must be set on a per-process basis, not per-app-context. This commit was SVN r16559.	2007-10-23 20:46:16 +00:00
George Bosilca	7a63f9b730	I somehow mess up my last commit. Sorry. This commit was SVN r16543.	2007-10-22 15:08:17 +00:00
George Bosilca	b93f72bdfd	Remove 2 warnings about uninitialized i and quit_flags. This commit was SVN r16542.	2007-10-22 15:01:15 +00:00
Jeff Squyres	5637c7a5a0	In addition to r16513, this commit fixes trac:1170. If we cannot resolve the route to the peer that we're trying to send to, don't queue up the message in the TCP OOB -- instead, return it to the upper layer (e.g., the RML) and let it decide what to do. In the case of the routed RML, the tree component will queue it up for later transmission. Hence, we don't want the message queued up both here in the TCP OOB and the tree routed. Also see some more discussion / explanation in #1171. This commit was SVN r16540. The following SVN revision numbers were found above: r16513 --> open-mpi/ompi@7ae9589d70 The following Trac tickets were found above: Ticket 1170 --> https://svn.open-mpi.org/trac/ompi/ticket/1170	2007-10-22 13:46:57 +00:00
Jeff Squyres	7ae9589d70	The header is at the address of the buffer pointed to by the iov, not the address of the iov. This commit was SVN r16513.	2007-10-19 12:40:14 +00:00
Jeff Squyres	abf1b728b9	Minor code maintenance fix -- put the THREAD_UNLOCK outside the if statement so that you only have to have it once. This commit was SVN r16512.	2007-10-19 12:36:26 +00:00
Ralph Castain	73eeb7f0d2	Fix a bug in the way we handled buffer releases and the conditioned wait that held us in the xcast until completed. This commit was SVN r16504.	2007-10-19 01:17:01 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Ralph Castain	ec5fe78876	When in the unity message routing mode, we have to update the RML contact info in the parent procs so that they know how to talk to the children. Ideally, this would be done in the MPI layer since that layer knows which procs are actively involved in the comm_spawn. However, it isn't being done there, which causes comm_spawn to fail, so do it explicitly in the RTE. Note that this means ALL procs in the parent job are updated, even though they may not be participating in the comm_spawn. This doesn't really hurt anything - just unnecessary. Comm_spawn still has a problem when a child process shares a node with a parent, so this doesn't fix everything. It only fixes the bug of ensuring all procs know how to talk to each other. This commit was SVN r16460.	2007-10-16 16:09:41 +00:00
Ralph Castain	713b6e13a5	Improve diagnostic output messages when errors are hit This commit was SVN r16457.	2007-10-16 14:51:52 +00:00
Josh Hursey	ea0652d20f	If we are going to pretend to do filem, then we should always pretend. No one should be using this feature except for me. :) This commit was SVN r16454.	2007-10-15 20:04:35 +00:00
Ralph Castain	b6196e8a39	When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed. In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean. I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it. This commit was SVN r16451.	2007-10-15 18:00:30 +00:00
Jeff Squyres	423f23eb6a	Fixes trac:1160. There is still some other problem in the OOB, but we wanted to commit this to get wider testing. This commit was SVN r16445. The following Trac tickets were found above: Ticket 1160 --> https://svn.open-mpi.org/trac/ompi/ticket/1160	2007-10-15 15:41:36 +00:00
Josh Hursey	f16a42947a	Change some default MCA parameters: - Global snapshot directory = $HOME - FileM 'rsh' = 'ssh' - FileM 'rcp' = 'scp' This commit was SVN r16444.	2007-10-15 15:21:17 +00:00
Josh Hursey	520c27ac94	If the HNP is acting as the orted for local launch then the gpr_replica variable is not defined. Make sure to set it to something reasonable so that file preloading still works (instead of seg faulting :) Thanks to Hiep Bui Hoang for reporting this bug. This commit was SVN r16433.	2007-10-11 19:47:04 +00:00
Josh Hursey	e483c36cea	Remove a big of debug in filem/rsh that should have never been committed. A guesture towards overlapping file removal with metadata update. This commit was SVN r16432.	2007-10-11 19:37:33 +00:00
Ralph Castain	3dbd4d9be7	Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by: 1. taking advantage of the fact that we no longer create the launch message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found. Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large. 2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc. Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once. The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes): Per-node scaling, taken at 1ppn: #nodes original trunk revised trunk time size time size 1 0.10 819 0.09 564 2 0.14 1070 0.14 677 3 0.15 1321 0.14 790 4 0.15 1572 0.15 903 8 0.17 2576 0.20 1355 16 0.25 4584 0.21 2259 32 0.28 8600 0.27 4067 64 0.50 16632 0.39 7683 Per-proc scaling, taken at 64 nodes ppn original trunk revised trunk time size time size 1 0.50 16669 0.40 7720 2 0.55 32733 0.54 11048 3 0.87 48797 0.81 14376 4 1.0 64861 0.85 17704 Condensing those numbers, it appears we gained: per-node message size: 251 bytes/node -> 113 bytes/node per-proc message size: 251 bytes/proc -> 52 bytes/proc per-job message size: 568 bytes/job -> 399 bytes/job (job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated) The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling. Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences. Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it. This commit was SVN r16428.	2007-10-11 15:57:26 +00:00
Rolf vandeVaart	25c95c9ee9	Fix build on solaris. Need to include sys/wait.h. This commit was SVN r16426.	2007-10-11 15:04:30 +00:00
Jeff Squyres	e2df42eea3	Move the <sys/wait.h> below "orte_config.h" This commit was SVN r16424.	2007-10-11 11:31:09 +00:00
George Bosilca	7cc9f588a8	Decorate the base functions with ORTE_DECLSPEC. This commit was SVN r16423.	2007-10-11 00:02:49 +00:00
Ralph Castain	53af94fd87	Modify the configure system so that gridengine support is only built in specific conditions: 1. --with-sge, always builds 2. --without-sge, never builds 3. if neither is specified, build if and only if either SGE_ROOT is set or "qrsh" is found in the path This commit was SVN r16422.	2007-10-10 21:39:16 +00:00
Josh Hursey	6e5341c659	Forgot to move a header in the code movement. This commit was SVN r16420.	2007-10-10 15:39:40 +00:00
Ralph Castain	82a8e2d10d	Reorganize the odls framework to place common functionality in the base, thus making maintenance easier. We still need this to be a framework as some environments (e.g., bproc) require significantly different functionality. However, there is quite a bit of commonality across the components, so this ensures that fixes in one get propagated across the others. This patch also fixes a minor bug discovered along the way: we had "lost" the passing of the oversubscribed condition flag from the mapper to the orteds. Thus, we were not setting sched_yield correctly when in oversubscribed conditions (except when a hostfile was specified - different logic there because we treat the number of slots allocated on the node as "uncertain") I did not modify the process component in this patch - I will send a proposed patch to the maintainers of that component so they can review it first. This commit was SVN r16418.	2007-10-10 15:02:10 +00:00
Josh Hursey	7f833a9cb2	silence a warning that is triggered on restart This commit was SVN r16417.	2007-10-10 14:25:49 +00:00
Ethan Mallove	d0b61db65c	Add in a missing #include for Solaris builds. This commit was SVN r16416.	2007-10-10 12:49:15 +00:00
Josh Hursey	aa8391f888	Local and global coordinators should be the only ones involved in the movement of checkpoint files. This reduces the overhead on the applicaiton. This commit was SVN r16412.	2007-10-09 19:52:47 +00:00
Galen Shipman	fda1306807	revert my stupidity.. This commit was SVN r16410.	2007-10-09 19:01:20 +00:00
Josh Hursey	8fe2ef5647	a missing include This commit was SVN r16402.	2007-10-09 14:32:36 +00:00
Josh Hursey	7437f37e96	This commit contains the following: * Fix some missing includes in a few places. * Add the cr_request() functionality to the BLCR CRS component. We are now dependent upon the 0.6.* series of BLCR. * Made the CR notification mechanism a registered function. This way we can have an OPAL-only version and it can be replaced at runtime with the ORTE version. * Add a 'opal_cr_allow_opal_only' parameter that will enable OPAL-only CR functionality when the user wants it. Default: Disabled. * Fix the placement of a checkpoint request check in MPI_Init * Pull the OPAL notification mechanism into the SnapC framework. * We no longer fork/exec the 'opal-checkpoint' command for local checkpointing, the Local coordinator in the orted does this directly. * The Local and Application coordinator talk together bypassing the OPAL notifiation mechanism. * Optimized the Local <-> App Coordinator communication. * Improved the structure used to track vpid_snapshots in the local coord. * Fix a race condition in which an application under heavy communication load may produce an inconsistent global checkpoint. This commit was SVN r16389.	2007-10-08 20:53:02 +00:00
Galen Shipman	1c1b9d5480	make cray happy This commit was SVN r16377.	2007-10-08 14:31:59 +00:00
Ralph Castain	54b2cf747e	These changes were mostly captured in a prior RFC (except for #2 below) and are aimed specifically at improving startup performance and setting up the remaining modifications described in that RFC. The commit has been tested for C/R and Cray operations, and on Odin (SLURM, rsh) and RoadRunner (TM). I tried to update all environments, but obviously could not test them. I know that Windows needs some work, and have highlighted what is know to be needed in the odls process component. This represents a lot of work by Brian, Tim P, Josh, and myself, with much advice from Jeff and others. For posterity, I have appended a copy of the email describing the work that was done: As we have repeatedly noted, the modex operation in MPI_Init is the single greatest consumer of time during startup. To-date, we have executed that operation as an ORTE stage gate that held the process until a startup message containing all required modex (and OOB contact info - see #3 below) info could be sent to it. Each process would send its data to the HNP's registry, which assembled and sent the message when all processes had reported in. In addition, ORTE had taken responsibility for monitoring process status as it progressed through a series of "stage gates". The process reported its status at each gate, and ORTE would then send a "release" message once all procs had reported in. The incoming changes revamp these procedures in three ways: 1. eliminating the ORTE stage gate system and cleanly delineating responsibility between the OMPI and ORTE layers for MPI init/finalize. The modex stage gate (STG1) has been replaced by a collective operation in the modex itself that performs an allgather on the required modex info. The allgather is implemented using the orte_grpcomm framework since the BTL's are not active at that point. At the moment, the grpcomm framework only has a "basic" component analogous to OMPI's "basic" coll framework - I would recommend that the MPI team create additional, more advanced components to improve performance of this step. The other stage gates have been replaced by orte_grpcomm barrier functions. We tried to use MPI barriers instead (since the BTL's are active at that point), but - as we discussed on the telecon - these are not currently true barriers so the job would hang when we fell through while messages were still in process. Note that the grpcomm barrier doesn't actually resolve that problem, but Brian has pointed out that we are unlikely to ever see it violated. Again, you might want to spend a little time on an advanced barrier algorithm as the one in "basic" is very simplistic. Summarizing this change: ORTE no longer tracks process state nor has direct responsibility for synchronizing jobs. This is now done via collective operations within the MPI layer, albeit using ORTE collective communication services. I -strongly- urge the MPI team to implement advanced collective algorithms to improve the performance of this critical procedure. 2. reducing the volume of data exchanged during modex. Data in the modex consisted of the process name, the name of the node where that process is located (expressed as a string), plus a string representation of all contact info. The nodename was required in order for the modex to determine if the process was local or not - in addition, some people like to have it to print pretty error messages when a connection failed. The size of this data has been reduced in three ways: (a) reducing the size of the process name itself. The process name consisted of two 32-bit fields for the jobid and vpid. This is far larger than any current system, or system likely to exist in the near future, can support. Accordingly, the default size of these fields has been reduced to 16-bits, which means you can have 32k procs in each of 32k jobs. Since the daemons must have a vpid, and we require one daemon/node, this also restricts the default configuration to 32k nodes. To support any future "mega-clusters", a configuration option --enable-jumbo-apps has been added. This option increases the jobid and vpid field sizes to 32-bits. Someday, if necessary, someone can add yet another option to increase them to 64-bits, I suppose. (b) replacing the string nodename with an integer nodeid. Since we have one daemon/node, the nodeid corresponds to the local daemon's vpid. This replaces an often lengthy string with only 2 (or at most 4) bytes, a substantial reduction. (c) when the mca param requesting that nodenames be sent to support pretty error messages, a second mca param is now used to request FQDN - otherwise, the domain name is stripped (by default) from the message to save space. If someone wants to combine those into a single param somehow (perhaps with an argument?), they are welcome to do so - I didn't want to alter what people are already using. While these may seem like small savings, they actually amount to a significant impact when aggregated across the entire modex operation. Since every proc must receive the modex data regardless of the collective used to send it, just reducing the size of the process name removes nearly 400MBytes of communication from a 32k proc job (admittedly, much of this comm may occur in parallel). So it does add up pretty quickly. 3. routing RML messages to reduce connections. The default messaging system remains point-to-point - i.e., each proc opens a socket to every proc it communicates with and sends its messages directly. A new option uses the orteds as routers - i.e., each proc only opens a single socket to its local orted. All messages are sent from the proc to the orted, which forwards the message to the orted on the node where the intended recipient proc is located - that orted then forwards the message to its local proc (the recipient). This greatly reduces the connection storm we have encountered during startup. It also has the benefit of removing the sharing of every proc's OOB contact with every other proc. The orted routing tables are populated during launch since every orted gets a map of where every proc is being placed. Each proc, therefore, only needs to know the contact info for its local daemon, which is passed in via the environment when the proc is fork/exec'd by the daemon. This alone removes ~50 bytes/process of communication that was in the current STG1 startup message - so for our 32k proc job, this saves us roughly 32k50 = 1.6MBytes sent to 32k procs = 51GBytes of messaging. Note that you can use the new routing method by specifying -mca routed tree - if you so desire. This mode will become the default at some point in the future. There are a few minor additional changes in the commit that I'll just note in passing: propagation of command line mca params to the orteds - fixes ticket #1073. See note there for details. * requiring of "finalize" prior to "exit" for MPI procs - fixes ticket #1144. See note there for details. * cleanup of some stale header files This commit was SVN r16364.	2007-10-05 19:48:23 +00:00
Brian Barrett	3a0067249c	The previous hack to deal with Libtool not speaking Objective C stopped working with Automake 1.10. This is a new hack, which should be much more flexible. The ras doesn't contain any Objective C, so remove the hack entirely from that Makefile.am. This commit was SVN r16269.	2007-09-30 03:40:25 +00:00
Rolf vandeVaart	a87267ef92	Fix a build error on Solaris. MAXHOSTNAMELEN is defined in netdb.h. This commit was SVN r16268.	2007-09-28 20:15:28 +00:00
Josh Hursey	665a1e280b	Copyright updates that should have gone into r16252. (Someday I'll learn to do this before committing) This commit was SVN r16260. The following SVN revision numbers were found above: r16252 --> open-mpi/ompi@e10f476c87	2007-09-27 14:37:04 +00:00
Josh Hursey	e10f476c87	Bring over the jjh-filem branch which contains a non-blocking FileM interface and implementation. This has shown drastic performance benefit when transferring Many files at roughly the same time. I tested this for many different filem operations and everything was working fine. Let me know if you have any problems with this functionality. Some Notes: - opal-checkpoint now has a 'quiet' flag to keep it from being too verbose. - FileM RSH component is fully non-blocking. - FileM RSH component has incomming connection throttling since by default ssh only allows 10 concurrent scp connections to any single host. This default can be adjusted via an MCA parameter. {{{-mca filem_rsh_max_incomming 10}}} - There is an MCA parameter for max outgoing connections, but it is currently not implemented. If someone needs it then it should not be hard to implement. {{{-mca filem_rsh_max_outgoing 10}}} - Changed the FileM request structure so that it is a bit more explicit and flexible. - Moved the 'preload-binary' and 'preload-files' functionality into odls/base allowing for code reuse in the 'process' and 'default' ODLS components. - Fixed a bug in the process name resolution which broke the 'preload-*' functionality due to GPR table structure changes. - The FileM RSH component might be able to see even more speedup from using a thread pool to operate on the work_pool structures, but that is for future work. - Added a 'opal-show-help' file to ODLS Base This commit was SVN r16252.	2007-09-27 13:13:29 +00:00
Josh Hursey	b5fc722c35	Add a flag to 'pretend' to do filem in snapc. This is useful when doing performance characterization, and should not be used by anyone doing anything else since it will not produce a globally consistent checkpoint in this mode. This commit was SVN r16192.	2007-09-24 16:19:45 +00:00
Jeff Squyres	f9b9beba77	Allow the LSF components to be shipped in the nightly tarball and open it up to others. This commit was SVN r16143.	2007-09-17 22:42:33 +00:00
Shiqing Fan	d4a7fb1378	- A small fix of format. This commit was SVN r16138.	2007-09-17 12:10:04 +00:00
George Bosilca	d32a54d74e	There is no values[1] ... How did the compilers goes away with this !!! This commit was SVN r16132.	2007-09-14 21:33:25 +00:00
George Bosilca	6897926dce	Not used anymore. This commit was SVN r16129.	2007-09-14 21:20:19 +00:00
Ralph Castain	45986ad2aa	Add support to signal application procs for LSF This commit was SVN r16120.	2007-09-13 18:09:14 +00:00
Ralph Castain	9fa254c017	Provide a better error message when a daemon unexpectedly dies under SLURM so we differentiate between fail to start and aborting while the app is running. This commit was SVN r16115.	2007-09-12 20:53:50 +00:00
Josh Hursey	b4c68c0925	Turn back on the absolute path protection for the moment. It is masking a bug that I'm tracking down in the SNAPC FULL - FILEM interations Also make sure to cleanout the filem structure before asking for another checkpoint file when not storing the files in place. This commit was SVN r16109.	2007-09-12 18:19:39 +00:00
George Bosilca	e5d316dba6	Coverty: fix issues with using a string once it get freed. The problem, is that the mca_base_register_string don't set the result to NULL is an error occurs. This commit was SVN r16108.	2007-09-12 18:16:53 +00:00
Ralph Castain	f80ea093a2	Ensure that the orteds do not directly respond to USR1/2 signals. Those signals are trapped by mpirun and propagated from there - at most, the orteds are involved in the propagation process, but should never do anything on their own. This commit was SVN r16098.	2007-09-12 14:32:31 +00:00
Shiqing Fan	548a4fe943	- Use IOVBASE_TYPE instead of char to avoid warnings on some systems. This commit was SVN r16092.	2007-09-11 16:24:23 +00:00
Shiqing Fan	c1065d8262	- Some more type casts. This commit was SVN r16087.	2007-09-11 11:28:43 +00:00
Brian Barrett	cfe737d1f9	Fix some mistaken error checks -- errors will be less than zero, not greater than zero This commit was SVN r16008.	2007-08-29 18:52:51 +00:00
Jeff Squyres	5628084fec	Fix Coverity CID 463: remove unused variable / dead code. This commit was SVN r15999.	2007-08-29 01:30:15 +00:00
Brian Barrett	dcf678dbab	Fix heterogeneous issue with non-blocking RML receive, where the sender field could be in the wrong endianness This commit was SVN r15989.	2007-08-28 20:54:52 +00:00
Josh Hursey	729c63cf9d	Fix invalid MCA 'base' names so they appear in ompi_info. A subset of this patch needs to be applied to v1.2 Refs trac:928 This commit was SVN r15918. The following Trac tickets were found above: Ticket 928 --> https://svn.open-mpi.org/trac/ompi/ticket/928	2007-08-18 03:05:45 +00:00
Brian Barrett	8294f6de03	The portals_utcp component doesn't actually need the POrtals libraries and only pokes at environment variables. So don't link in the libraries, as that causes a whole other set of problems. This commit was SVN r15899.	2007-08-17 03:48:39 +00:00
Andrew Friedley	2eedcd2539	Fixes trac:1047 Tie stdin to /dev/null to prevent stdin from being closed and thus making stdin not work in slurm allocations. This commit was SVN r15892. The following Trac tickets were found above: Ticket 1047 --> https://svn.open-mpi.org/trac/ompi/ticket/1047	2007-08-16 20:49:27 +00:00
Tim Prins	5a795128af	Change it so that different components in orte use unique rml tags This commit was SVN r15881.	2007-08-16 14:02:35 +00:00
Brian Barrett	330003361b	* Free memory from asprintf * need to compare ERANGE to errno This commit was SVN r15860.	2007-08-14 21:12:00 +00:00
Brian Barrett	881dd0654e	* Provide a hook so that a PLS can tell the orted it's starting that it needs to override the default umask. By default, this is not used since most environments do what the user would expect without any help. * Have TM use the newly added umask hook, so that processes inherit the user's umask from mpirun rather than the pbs_mom's umask, which the user has no control over. This commit was SVN r15858.	2007-08-14 18:44:52 +00:00
Shiqing Fan	eea712f9ab	- Export those components in correct way. This commit was SVN r15804.	2007-08-08 16:20:17 +00:00
Brian Barrett	59524a9009	Fix issue where we set state to SHUTDOWN rather than CONNECTING when we had to switch socket types. This commit was SVN r15784.	2007-08-06 22:55:41 +00:00
Ralph Castain	eb3a97f428	Don't overwrite the local rank key This commit was SVN r15776.	2007-08-06 16:56:23 +00:00
Shiqing Fan	d10570786c	- A small fix, add missed flag parameters. This commit was SVN r15774.	2007-08-06 16:15:38 +00:00
Josh Hursey	755658694e	Bring in changes to support Cray's Compute Node Linux (CNL) and Application Level Placement Scheduler (ALPS). This commit was tested under two Cray machines at ORNL: Jaguar (Catamount) and Rizzo (CNL Test cage). Both machines performed as they should across the commit. It is likely that mor changes will follow this the work and environment stabilizes. Most of the infrastructure works the same for Catamount and CNL except for a few bits. Below are the highlights: Default IFACE Change: On Catamount we can use PTL_IFACE_DEFAULT, but on the CNL system we have access to will fail on this interface, and should be set to: IFACE_FROM_BRIDGE_AND_NALID(PTL_BRIDGE_UK,PTL_IFACE_SS). So if we detect that we are running with YOD then use the former interface and if we detect that we are running with ALPS then use the latter. We will want to pursue a more elegant solution if this interface continues to change across machines. PtlGetId and cnos_register_ptlid: The header suggests that these should never be called when launching with YOD. But in the ALPS environment the cnos_barrier() will hang forever if these functions are not called after PtlNIInit(). Since these functions only need to be called once, and the orte rmgr/cnos component is loaded before the ompi common/portals componet then just call these functions once in the rmgr/cnos component. cnos_barrier_init(): This is a noop for YOD, but critical for ALPS. So be sure to call it before calling the first barrier in the rmgr/cnos component. cnos_barrier vs cnos_pm_barrier: It is suggested the cnos_pm_barrier only be used during finalization as it will indicate to the launcher (yod or aprun) that the app is about to complete. It was suggested that we use the regular cnos_barrier() instead. I want to look into this a bit more to make sure there are not adverse side effects. A note has been placed in the code to indicate this reasoning. This commit was SVN r15756.	2007-08-03 19:46:38 +00:00
Jeff Squyres	106beff744	Ahem. Apparently we should be checking for ORTE_EQUAL upon return from orte_ns.compare_fields(), not 0 (yes, they're the same [today], but it is much better to check for symbolic names...). This commit was SVN r15731.	2007-08-01 18:59:37 +00:00
Jeff Squyres	8d4b6c7b0d	The HNP changing into an orted brought a bug in the iof svc component to light: we weren't ack'ing properly for streams that originated (or originated via proxy) and terminated within the HNP. This commit fixes that. It also fixes a few style issues, and added some more opal_outputs for debugging. Also, fixed a bug where the fact that we forwarded (and therefore might need to update the ack) was not correctly reported if there were multiple forwards (which there are not as the system is currently using IOF, but there could be). Refs trac:1098 -- want to get another pair of eyes to look at this before I close the ticket. This commit was SVN r15730. The following Trac tickets were found above: Ticket 1098 --> https://svn.open-mpi.org/trac/ompi/ticket/1098	2007-08-01 18:38:03 +00:00
Ralph Castain	066ff38d42	Ensure we read all the reported URI contact info when we fork an HNP for singleton support This commit was SVN r15714.	2007-07-31 18:55:08 +00:00
George Bosilca	2e2bf472ff	Mark the orte_abort function as noreturn and change the return value from int to void. This function call exit at the end, so there is no way to return from there. Apply the same thing to the errmsg_abort function and update all components. This commit was SVN r15704.	2007-07-31 16:09:52 +00:00
Sven Stork	855434de59	- fixes several coverty issues - add missing initialisation for variables - use strncpy instead of strcpy This commit was SVN r15683.	2007-07-30 14:44:37 +00:00
Rainer Keller	2c5d07217d	- Coverity: use snprintf, instead of sprintf.... This commit was SVN r15669.	2007-07-29 11:23:23 +00:00
Jeff Squyres	3858cf48c0	Stop using the deprecated ORTE_NAME_ARGS() and switch to ORTE_NAME_PRINT(). This commit was SVN r15665.	2007-07-27 13:33:20 +00:00
Josh Hursey	acbc8ecca3	- On Cray XT systems stop the grpcomm basic component from building. grpcomm cnos component - Remove the .ompi_ignore - add a configure.m4 that should keep it from building on any system other than Cray XT* (copied from rml/cnos) - Fix some mis-named symbols resulting from cut/paste errors. This patch brings the Cray build back into 'working' order. This commit was SVN r15651.	2007-07-26 20:42:06 +00:00
Jeff Squyres	188d529beb	* We do need the LSF task ID as part of our vpid * Accidentally had the PLS LSF using the env SDS; switch it back to the LSF SDS This commit was SVN r15650.	2007-07-26 20:22:36 +00:00
Josh Hursey	e5a03e7734	- Remove Makefile.in from version control - Add back support for cnos (copy functionality lost by moving the interface from the RML). - Fix some cut/paste errors. This commit was SVN r15646.	2007-07-26 18:52:17 +00:00
Jeff Squyres	75192de1fc	LSF support is now working. W00t! May be subject to a further tweak or two. * checking lsb_init() is not sufficient to know whether you're in an LSF job or not; you also need to check for environment variable markers * remove lots of debugging output * no need for the sds lsf to call lsb_init() * remove some slurm-like dead code and a copy-n-paste error in the sds lsf This commit was SVN r15644.	2007-07-26 18:49:29 +00:00
Jeff Squyres	8e9c71282d	Add a bunch more [conditional] debugging output. This commit was SVN r15643.	2007-07-26 18:46:46 +00:00

1 2 3 4 5 ...

1279 Коммитов