openmpi

Автор	SHA1	Сообщение	Дата
George Bosilca	9b16827049	Add ORTE_DECLSPEC, and few conversions. This commit was SVN r13268.	2007-01-24 00:52:08 +00:00
George Bosilca	1e38810c2d	Correctly close the sockets on a generic way. This commit was SVN r13254.	2007-01-23 03:17:23 +00:00
Ralph Castain	46b7df5683	Fix bproc nodename to correctly assign process-to-nodename mapping This commit was SVN r13244.	2007-01-22 19:39:00 +00:00
Jeff Squyres	6584df9262	For --prefix-like behavior, we used to modifiy environ directly and then exec the "srun..." from there. But somewhere along the line, we switched to having a copy of environ and modifying that. It looks like we forgot to update the stuff for --prefix behavior. So this commit fixes the setenv's for PATH and LD_LIBRARY_PATH to modify the environ copy (not environ itself) so that the values properly get passed down to the srun environment via execve(). This restores --prefix behavior in the SLURM pls. This commit was SVN r13239.	2007-01-22 15:50:35 +00:00
George Bosilca	1b92589179	Update the PLS process. This commit was SVN r13236.	2007-01-22 05:48:25 +00:00
Rainer Keller	c2e9075d29	- Define a OPAL_CLASS_EMPTY to be used for initialization. Similar within the dt_module for the predefined datatypes. This commit was SVN r13228.	2007-01-21 15:52:06 +00:00
Ralph Castain	b63d4ddfbf	Clean up a compiler warning under bproc This commit was SVN r13198.	2007-01-18 20:07:06 +00:00
Ralph Castain	1487e22ec8	Store the mapping mode so that it can be recovered later This commit was SVN r13197.	2007-01-18 20:00:15 +00:00
Ralph Castain	455e4ada9a	Bring the modified/updated pernode and npernode behaviors over from the openrte repository. This change enables npernode to pay attention to the total #procs to be launched, and cleans up the bynode vs. byslot mapping directives when in pernode and npernode modes. This commit was SVN r13191.	2007-01-18 17:15:19 +00:00
Brian Barrett	ffe35ef6b8	Update the Cray XT3 run-time support files to compile with latest RTE changes This commit was SVN r13172.	2007-01-17 22:47:27 +00:00
Ralph Castain	e093b5a256	Check for NULL before release, just to be safe This commit was SVN r13162.	2007-01-17 21:29:34 +00:00
Jeff Squyres	6f7adfe231	Fix for the oob base open and close functions being invoked twice by ompi_info -- once directly and once via the rml oob component. This commit was SVN r13152.	2007-01-17 15:18:13 +00:00
Jeff Squyres	3983141342	Remove this extra #if -- it wasn't necessary and was causing compiler warnings. This commit was SVN r13146.	2007-01-17 13:53:02 +00:00
Ralph Castain	f08210b3e1	Fix a double-free error This commit was SVN r13126.	2007-01-16 15:49:16 +00:00
George Bosilca	0f68bb0bc0	Add a debug flag for the ODLS process. This commit was SVN r13075.	2007-01-11 00:17:34 +00:00
George Bosilca	c8222b57eb	The Windows PLS now is able to spawn process locally. This commit was SVN r13074.	2007-01-11 00:16:58 +00:00
Rolf vandeVaart	9fd5e55b50	Need to include strings.h because that is where the rindex() function prototype lives. Without this, we get compile warnings. In addition, for 64-bit Solaris, we get a segmentation fault from orterun without this include. This commit was SVN r13065.	2007-01-10 18:44:08 +00:00
Brian Barrett	03112254e7	Increase connection timeout to 600 seconds, which should always be higher than the connect() timeout, so that we'll use that rather than our own timeout by defualt. There timeout was set low for Big Red, but causes problems for very large clusters, as there's no way to wire them up in 10 seconds most of the time. This commit was SVN r13062.	2007-01-10 04:53:21 +00:00
George Bosilca	c7da2b0a9a	Add the process PLS. It's only intended for Windows users. This commit was SVN r13051.	2007-01-09 00:19:52 +00:00
George Bosilca	77452ea8ea	Add a missing include and update the definition of orte rds proxy component. This commit was SVN r13042.	2007-01-08 22:00:01 +00:00
Brian Barrett	e130f18cc2	Fix some compiler warnings that have slipped in lately... This commit was SVN r13037.	2007-01-08 17:20:09 +00:00
Brian Barrett	a34e67d743	Remove unneeded PARAM_INIT_FILE variable in configure.params files used by components that use configure.m4 for configuration or are always built. The macro has not been needed since moving to configure types other than configure.stub Fixes trac:590 This commit was SVN r13031. The following Trac tickets were found above: Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590	2007-01-08 03:44:22 +00:00
Jeff Squyres	a91c017f81	The constant name changed from ORTE_RML_NAME_ANY to ORTE_NAME_WILDCARD -- upcate the comments/documentation to match. This commit was SVN r13001.	2007-01-05 13:38:22 +00:00
Ralph Castain	3ce9b2f6cc	Remove some debugging output that mistakenly was left behind. This commit was SVN r12984.	2007-01-04 17:24:11 +00:00
Ralph Castain	6101050ea6	Remove an abstraction barrier I thought was gone long-ago. The OOB subscription really shouldn't be defined as an OMPI subscription. I know it's just a technicality, but it is time to address such things rather than just letting them continue to propagate. :-) This commit was SVN r12954.	2007-01-02 16:16:50 +00:00
Ralph Castain	90f5e3fad8	Fix a buglet in the singleton startup procedure. For purposes of minimizing the xcast message, we "strip" the descriptive info on all subscription messages. This means, though, that we have to store the process name and other info so it can be retrieved in the body of the subscription data (as opposed to in the description). This wasn't being done for singletons because they don't call the RMAPS to "map" themselves. This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it". This change will need to migrate across to 1.2 after it "soaks" the appropriate time. This commit was SVN r12952.	2007-01-02 16:14:44 +00:00
Rich Graham	6cb2377015	Change the allocation of the shared memory backing file. The file is allocated on a per comm_world instance, with the lowest rank in comm_world on the given host creating and initializing the file, and then notifying the remaining files via the OOB. Reviewed: Ralph Castain, Brian Barrett Addressing ticket #674. This commit was SVN r12949.	2007-01-01 02:39:02 +00:00
Brian Barrett	b5057d923e	fix compile error that crept in with orte changes. This just makes it compile -- I can't check for correctness on this platform. Someone probably wants to do that. This commit was SVN r12948.	2006-12-31 20:16:20 +00:00
Li-Ta Lo	6df4e80727	new XCPU PLS and SDS to work with libxcpu This commit was SVN r12905.	2006-12-21 00:05:36 +00:00
Brian Barrett	414a87b14c	On OS X, the lowest 8 bits of the exit status are for signals, so pushing rc (which is -1 or 4 if we hit this case) resulted in an odd error that a signal killed the proc (instead of a startup error, as is reality). Instead, use the W_EXITCODE macro (if available) to build up an exit code that has an error code for exit status, but does not make it look like the process died from a signal This commit was SVN r12890.	2006-12-18 02:30:05 +00:00
Ralph Castain	a0ef517550	Fix some errors in the bproc components that prevented compiling. Thought I had already done this, but either those changes were lost when I did the merge, or my old man's memory is fading.... Whaz-at??? :-) This commit was SVN r12874.	2006-12-15 19:40:04 +00:00
Ralph Castain	1e1d0e8a89	Set the app_num attribute into the process environment so we pick it up on the other end This commit was SVN r12868.	2006-12-15 16:43:52 +00:00
Ralph Castain	677d1260aa	cleanup nicely if we don't launch This commit was SVN r12867.	2006-12-15 14:03:53 +00:00
Ralph Castain	cbb660504c	Retain the ability to run valgrind on the bproc launcher - do not call bproc_version if "nolaunch" is specified. This commit was SVN r12866.	2006-12-15 14:01:21 +00:00
Ralph Castain	64ec238b7b	Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs. This commit was SVN r12864.	2006-12-15 02:34:14 +00:00
Brian Barrett	38c2e43ac2	Print out error string rather than errno for TCP-related errors, making it easier for both the user and us to debug issues with BTL and OOB issues... This commit was SVN r12852.	2006-12-14 18:20:43 +00:00
Ralph Castain	3b064a624e	For convenience, revise the orte_job_map_t object so it includes the vpid start/range values, the number of nodes, and the number of processes on each node. These values are all used in various places in the code base - we currently re-compute them multiple times. Since these values do not change and are already being computed by the RMAPS framework, we might as well just save them for re-use. This commit was SVN r12829.	2006-12-12 16:07:23 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Ralph Castain	8314e8dbb9	Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode: 1. no -np provided - put one proc/node across all allocated nodes 2. -np N provided, N > #nodes - we print a pretty error message and exit 3. -np N provided, N <= #nodes - put one proc/node across N nodes I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error. This commit was SVN r12821.	2006-12-11 18:07:07 +00:00
Ralph Castain	0a5d41857a	Complete next round of message size reduction: "strip" the descriptive info from the returned values. I have now added a flag to the gpr address mode (ORTE_GPR_STRIPPED) that instructs the gpr to not include segment names or tokens in the returned gpr_value_t objects. I found only two places that were looking at the tokens: 1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message. 2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was always the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes). This commit was SVN r12813.	2006-12-09 23:10:25 +00:00
Ralph Castain	58569546ed	Fix the fix to remove compiler warning - an incorrect "\" was placed in the command string. This commit was SVN r12805.	2006-12-08 04:17:38 +00:00
Sven Stork	78173a697a	Replace the test opertion "-e" with "-r" to improve the protability. Refs: #392 This commit was SVN r12790.	2006-12-07 12:14:40 +00:00
Ralph Castain	62d7826e01	Helps if we total up the correct field to get the total number of slots in the universe This commit was SVN r12789.	2006-12-07 03:17:12 +00:00
Ralph Castain	a1153fdc8f	Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment. Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available. This commit was SVN r12788.	2006-12-07 03:11:20 +00:00
Ralph Castain	d4bd60c9fe	Restore the paffinity capability, along with all the required logic to ensure we "do the right thing" when the user gives us inaccurate information about the number of slots on a remote node. This commit was SVN r12780.	2006-12-06 15:59:34 +00:00
Ralph Castain	b1e16fffac	Add the C++ doo-hicky stuff around the odls framework definitions just in case somebody, somewhere, on some remote planet where only goats can feed needs it. This commit was SVN r12777.	2006-12-06 13:58:04 +00:00
Ralph Castain	8ca415a0c5	Remove duplicate orte_odls declaration This commit was SVN r12776.	2006-12-06 13:44:41 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Tim Prins	08d5ca821f	Don't get the node architecture when useing the LoadLevleer RAS. It is slow (about a second for ~300 nodes) and we don't even use the value. This commit was SVN r12758.	2006-12-05 13:47:53 +00:00
Ralph Castain	eb941d8ae2	Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag. Kudos to Tim Prins for finding it. This commit was SVN r12757.	2006-12-05 13:04:27 +00:00
Gleb Natapov	f0132b2499	Provide parameters in a correct order (processor/oversubscribed was swapped). This commit was SVN r12737.	2006-12-04 12:55:45 +00:00
George Bosilca	a0ed53d70b	Make the compilers happy. This commit was SVN r12729.	2006-12-03 00:19:11 +00:00
George Bosilca	3fd278c522	Make the tree compile in debug mode. This commit was SVN r12724.	2006-12-01 23:03:09 +00:00
Ralph Castain	897744cdeb	Two major changes to the runtime: 1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1". 2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial". 3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1". 4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer) This commit was SVN r12722.	2006-12-01 22:30:39 +00:00
Jeff Squyres	3cf7dddd47	Fixes trac:635. Ralph identified the problem, I tracked down ''where'' the fd was being closed, and Brian figured out ''why'' (and the fix). What was happening is that a remote process was closing its stdout/stderr and therefore sending a 0-byte IOF message to mpirun. mpirun, in turn, closed the iof endpoint associated with that stream (i.e., stdout/stderr). IOF does this to handle the case where mpirun's stdin is closed -- this therefore causes the stdin on all the ORTE-started processes to have their stdin's closed as well. So the workaround here is to check that if we get a 0-byte IOF message on a sink (indicating a remote closure), and if that sink is the special stdout or stderr stream, don't actually close anything in the local process. This commit was SVN r12691. The following Trac tickets were found above: Ticket 635 --> https://svn.open-mpi.org/trac/ompi/ticket/635	2006-11-28 21:42:49 +00:00
Ralph Castain	0398c9e0c5	Correctly setup the sched_yield when launching processes via the orteds. This still doesn't adjust the yield schedule "on-the-fly" as more procs are dynamically added to a node - it just sets it when they are first launched. This commit was SVN r12683.	2006-11-28 08:27:20 +00:00
George Bosilca	8df8d86b85	Complete the functions to match the expected prototype. This commit was SVN r12680.	2006-11-28 00:44:30 +00:00
Ralph Castain	bc4e97a435	First stage in the move to a faster startup. Change the ORTE stage gate xcast into a binary tree broadcast (away from a linear broadcast). Also, removed the timing report in the gpr_proxy component that printed out the number of bytes in the compound command message as the answer was "not much" - reduces the clutter in the data. This commit was SVN r12679.	2006-11-28 00:06:25 +00:00
Ralph Castain	9bc25f0bec	Fix a potential bug in the registry where it didn't fully check a segment's name when searching for it. Will have to verify that this doesn't break other things. Bring the bproc system close to being back online.... This commit was SVN r12659.	2006-11-23 04:17:37 +00:00
Ralph Castain	deb2470ba3	Move the waitpid callback in the bproc pls after we store the daemon info. Otherwise, a short-lived app could terminate before we store the daemon info, causing mpirun to not terminate the daemons since the call to get_active_daemons would return a NULL list. This commit was SVN r12656.	2006-11-22 22:49:22 +00:00
Ralph Castain	b1ff5fe868	Move the name of the bproc common segment to the central schema location - avoids conflicts when bproc 3 components try to build This commit was SVN r12654.	2006-11-22 20:23:17 +00:00
Ralph Castain	8080034eb2	Clean up a compile issue for bproc This commit was SVN r12653.	2006-11-22 19:50:27 +00:00
Ralph Castain	428c1f14c3	Modify the bproc components to resolve the current allocation problem This commit was SVN r12652.	2006-11-22 19:10:58 +00:00
Sven Stork	dc116d4814	- Add missing mutex lock This commit was SVN r12649.	2006-11-22 13:37:58 +00:00
Ralph Castain	6fca1431f3	Back out some prior commits. These commits fixed bproc so it would run, but broke several other things (singleton comm_spawn and hostfile operations have been identified so far). Since bproc is the culprit here, let's leave bproc broken for now - I'll work on a fix for that environment that doesn't impact everythig else. This commit was SVN r12648.	2006-11-22 13:30:21 +00:00
Brian Barrett	0895f5e08d	Rename OMPI_PROCESS_NAME_{HTON, NTOH} macros to ORTE_PROCESS_NAME_{HTON, NTOH} because they are in ORTE, not OMPI. Also, remove the ORTE_PROCESS_NAME macros in iof base as they are duplicates of the ones that were in ns_types, which meant that bad things happened if you changed what an orte_process_name_t looked like. This commit was SVN r12646.	2006-11-22 03:03:21 +00:00
Ralph Castain	761d4fab25	cleanup unused variables This commit was SVN r12643.	2006-11-21 20:51:39 +00:00
Ralph Castain	a30c65ca24	Fix the allocator to make bproc happy. We were burned again by the fact that the bproc state monitor creates entries on the node segment for all the nodes in the cluster when it is opened during orte_init. As a result, the bjs allocator was never being called, and the system merrily assumed that all nodes in the cluster had been allocated to it. To fix this, I removed a test that had been inserted into the allocation procedure that checked for a non-zero node segment. This was an old artifact - the RAS components already know that they are not to overwrite any existing node segment entries (at least, bproc does - I will check the others. For now, I just want to save the bproc fix on this machine). This commit was SVN r12640.	2006-11-21 19:52:55 +00:00
Ralph Castain	050e401671	Simplify - the cellid is simply a field in the process name. Recently, we decided to just directly access it and get rid of extraneous function calls. This commit was SVN r12638.	2006-11-21 09:59:24 +00:00
Tim Prins	2afb401e39	fix some compile warnings This commit was SVN r12636.	2006-11-21 00:33:10 +00:00
Galen Shipman	c06b740220	a few more opps to get the cellid... This fixes a compilation error on bproc machines.. This commit was SVN r12631.	2006-11-20 19:45:01 +00:00
Ralph Castain	33affed09c	Bring ortehalt to a preliminary capability. It will corectly order a persistent daemon to exit cleanly. Need to now interface it to orterun, clean up a few things here and there This commit was SVN r12626.	2006-11-18 04:47:51 +00:00
Ralph Castain	ea1e0d34c8	Fix the SLURM PLS and SDS components so they correctly communicate the name of the node. This repairs a number of comm_spawn issues seen on Odin and other SLURM machines. This commit was SVN r12621.	2006-11-17 20:51:03 +00:00
Ralph Castain	3c5a2cd17b	Cleanup a few warnings for unused variables This commit was SVN r12620.	2006-11-17 19:32:49 +00:00
Ralph Castain	f771cc4fbd	Modify the reuse daemons procedure so we only generate the add_local_procs message once. Revise the display-map-at-launch option so the RMAPS framework takes responsibility for implementation of that option. Modify the RMAPS framework so we eliminate communicating a map to a backend node when certain attributes are set. The proxy functions are now implemented in the base, and a check made for HNP/non-HNP operation made in the map_jobs function prior to execution. This commit was SVN r12619.	2006-11-17 19:06:10 +00:00
Ralph Castain	ca5b4358fa	Need to revise the display-map-at-launch option so it is active not only for the initial launch, but applies to any subsequent comm_spawn events too. Add placeholders for the new orte tools. These don't actually do anything yet - in fact, I have set the .ompi_ignore so that you won't compile them (I have set a .ompi_unignore for me). Please let me know if you encounter any trouble with this - the ompi_ignore's should protect everyone. This commit was SVN r12616.	2006-11-17 02:58:46 +00:00
Ralph Castain	5ddcb8a652	Ensure the orted kills all local procs when exiting. Add a little clarity to some of the debugging output This commit was SVN r12615.	2006-11-16 21:15:25 +00:00
Ralph Castain	c1813e5c5a	Extend the daemon reuse functionality to most of the other environments. Note that Bproc won't support this operation, so we just ignore the --reuse-daemons directive. I'm afraid I don't understand the POE and XGrid environments well enough to attempt the necessary modifications. Also, please note that XGrid support has been broken on the trunk. I don't understand the code syntax well enough to make the required changes to that PLS component, so it won't compile at the moment. I'm hoping Brian has a few minutes to fix it after SC. This commit was SVN r12614.	2006-11-16 15:11:45 +00:00
Ralph Castain	a17e27dfd5	Sign of old age.....fix some compiler complaints This commit was SVN r12611.	2006-11-15 22:28:14 +00:00
Ralph Castain	f7fc19a2ca	Create the ability to re-use existing daemons. Included in the commit: 1. new functionality in the pls base to check for reusable daemons and launch upon them 2. an extension of the odls API to allow each odls component to build a notify message with the "correct" data in it for adding processes to the local daemon. This means that the odls now opens components on the HNP as well as on daemons - but that's the price of allowing so much flexibility. Only the default odls has this functionality enabled - the others just return NOT_IMPLEMENTED 3. addition of a new command line option "--reuse-daemons" to orterun. The default, for now, is to NOT reuse daemons. Once we have more time to test this capability, we may choose to reverse the default. For one thing, we probably want to investigate the tradeoffs in start time for comm_spawn'd processes that reuse daemons versus launch their own. On some systems, though, having another daemon show up can cause problems - so they may want to set the default as "reuse". This is ONLY enabled for rsh launch, at the moment. The code needing to be added to each launcher is about three lines long, so I'll be doing that as I get access to machines I can test it on. This commit was SVN r12608.	2006-11-15 21:12:27 +00:00
Andrew Friedley	3f550c3a83	missing a semicolon... This commit was SVN r12607.	2006-11-15 20:35:27 +00:00
Ralph Castain	437f2b044d	Modify the orted command communication system in two ways: 1. use non-blocking sends to transmit commands (this was actually done in a prior commit) 2. have an "ack" message sent back from the orted when it completes the command The latter item is the new one here. With my prior commit, it was possible for the HNP to move on to other things before the orted had completed its command. This caused the HNP to occassionally exit before the orted, thus generating "lost connection" errors. With this change, we retain the parallel nature of the command communications, but still hold the HNP at that point until the orteds are done. Best of both worlds. This commit was SVN r12605.	2006-11-15 15:09:28 +00:00
Ralph Castain	22a473a831	Fix a compiler complaint for too many arguments to a printf command - thanks to Tim Prins for pointing it out to me. This commit was SVN r12604.	2006-11-15 14:52:48 +00:00
Ralph Castain	6d6cebb4a7	Bring over the update to terminate orteds that are generated by a dynamic spawn such as comm_spawn. This introduces the concept of a job "family" - i.e., jobs that have a parent/child relationship. Comm_spawn'ed jobs have a parent (the one that spawned them). We track that relationship throughout the lineage - i.e., if a comm_spawned job in turn calls comm_spawn, then it has a parent (the one that spawned it) and a "root" job (the original job that started things). Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it. I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn). This commit was SVN r12597.	2006-11-14 19:34:59 +00:00
Ralph Castain	f95e20e2e1	Add another test program - an MPI app that just spins. This supports testing of system response to signal-terminated processes. Add some debugger output to the ODLS default component. Modify the orted command communication system so that it is done via non-blocking sends. This removes the linearity of the transmission and improves the response time. This commit was SVN r12585.	2006-11-13 21:51:34 +00:00
Tim Prins	39bc652899	Refs trac:612 Make it so if -np was not passed and -pernode was, we map bynode This commit was SVN r12580. The following Trac tickets were found above: Ticket 612 --> https://svn.open-mpi.org/trac/ompi/ticket/612	2006-11-13 19:13:21 +00:00
Ralph Castain	4636125e2d	Modify the RMGR components to allow job setup with a given jobid, and add another attribute so that we can setup triggers without launching. Add some debugging output to the ODLS default module, and the orted. Remove the nodename data from the ODLS info report - that info is already stored in the registry by the RMAPS framework upon completing the mapping procedure. Add another test program that does an ORTE-only dynamic spawn (gasp!). Looks just like comm_spawn - just no MPI involved. Modify the ODLS to release the processor when we "kill" local procs in a more scalable fashion. It previously had a sleep in it that Jeff's prior commit removed. However, he introduced some Windows code into the non-Windows component (protected by "if"s, but unnecessary). This is a more general solution he proposed - included here so I could get things to compile properly. This commit was SVN r12579.	2006-11-13 18:51:18 +00:00
Jeff Squyres	bfdf801487	Replace a "sleep(1)" with a yield so that the orted can reap processes much faster. This commit was SVN r12575.	2006-11-13 12:45:03 +00:00
Ralph Castain	4e50cdae52	This commit accomplishes two things: 1. Fix the "hang" condition when an application isn't found. It turned out that the ODLS had some difficulty with the process actually not having been started - hence, it never called the waitpid callback. As a result, the "terminated" trigger didn't fire, and so mpirun didn't wake up. With this change, the HNP's errmgr forces the issue by causing the trigger to fire itself when an abort condition occurs. 2. Shift the recording of the pid and the nodename from mpi_init to the orted launcher. This allows programs such as Eclipse PTP to get the pids even for non-MPI applications. In the case of bproc, the pls handles this chore since we don't use orteds in that system. This commit was SVN r12558.	2006-11-11 04:03:45 +00:00
Sven Stork	9ba4c4a7ee	- Add support for "sh" handling. Instead of detecting of bash we now check for bourne shell, because bourne shell is the smallest common divisor for bash/ksh/sh. - Make some shell expressions sh compatible This commit was SVN r12509.	2006-11-09 10:16:45 +00:00
Ralph Castain	ea77beca29	cleanup the TM modules in prep for T-bird tests. The TM RAS will now report time required to resolve hostnames This commit was SVN r12449.	2006-11-06 20:56:18 +00:00
Ralph Castain	884caeb2c7	Add timing tests for the TM ras This commit was SVN r12445.	2006-11-06 18:41:22 +00:00
Galen Shipman	68d9922f44	enable/disable connection sleep in oob_tcp.c via mca param.. on by default.. This commit was SVN r12444.	2006-11-06 18:00:46 +00:00
Ralph Castain	d182ae7472	Clean up a few compiler warnings courtesy of Jeff This commit was SVN r12430.	2006-11-03 20:45:22 +00:00
Ralph Castain	c77f6c605e	Update timing reports: 1. Remove timing of xcast from mpi_init 2. Add timing report from oob_xcast on how long it took to send the message This commit was SVN r12428.	2006-11-03 18:55:05 +00:00
Ralph Castain	761c8beeb7	Add output of the compound command size to the timing reports This commit was SVN r12423.	2006-11-03 16:11:37 +00:00
Ralph Castain	60e27c77e7	Add some additional timing reporting: 1. Added reporting points around the xcasts in MPI_Init. Note that these times will include time spent waiting for a trigger to fire, which is why the times between stage gates did NOT include these times initially. The inter-stage-gate times still do NOT include the xcast time - the xcast time is reported separately. 2. Added the process vpid on the MPI_Init timing reports for clarity. 3. Added a report from the xcast function on the HNP that outputs the number of bytes in the message being sent to the processes. This commit was SVN r12422.	2006-11-03 16:04:40 +00:00
Ralph Castain	2472f76d89	Make sure the print_map doesn't get called if we don't do the map step in rmgr.spawn This commit was SVN r12404.	2006-11-02 05:52:24 +00:00
Ralph Castain	dc4de57253	Per request from the Eclipse team, change the attributes that control the "flow" through the rmgr.spawn procedure from "stop after this step" to "do this step". This allows them to use the spawn procedure to complete the launch one step at a time so they can let the user see the results after each step. Shouldn't affect anyone else. This commit was SVN r12403.	2006-11-02 05:15:24 +00:00
Ralph Castain	233dac8bba	Enable the new attributes for remote operations such as comm_spawn This commit was SVN r12382.	2006-10-31 23:52:25 +00:00
Ralph Castain	30de73a712	Add a few attributes that are helpful for folks doing things like Eclipse. Also add yet another command-line option to orterun to support one of the new attributes. These include: 1. ORTE_RMAPS_DISPLAY_AT_LAUNCH: pretty-prints out the process map right before we launch so you can see where everyone is going. This is settable via the command line option "--display-map-at-launch" 2. ORTE_RMGR_STOP_AFTER_SETUP: just setup the job and then return from the spawn command. 3. ORTE_RMGR_STOP_AFTER_ALLOC: return from the rmgr.spawn call after allocating the job 4. ORTE_RMGR_STOP_AFTER_MAP: return from the rmgr.spawn call after mapping the job. This gives folks a chance to retrieve and graphically display the map, let the user edit it, and store the results. They can then call "launch" on their own and the system will use the revised map. Enjoy! My personal favorite is the first one - helps with debugging. This commit was SVN r12379.	2006-10-31 22:16:51 +00:00
Ralph Castain	17c71f8d2a	Fix ticket #545 Setup subscriptions to correctly return the MPI_APPNUM attribute. Fix an unreported bug that was found. The universe size was incorrectly defined in the attributes code. As coded, it looked for size_t values and based its size computation on those numbers. Unfortunately, the node_slots value had been changed to an orte_std_cntr_t awhile back! So the universe size was never updated. Update the hello_nodename test to check for MPI_APPNUM. Add a definition to ns_types for ORTE_PROC_MY_NAME - just a shortcut for orte_process_info.my_name. Brought over from ORTE 2.0 as it will be used extensively there. This commit was SVN r12377.	2006-10-31 21:29:07 +00:00
Tim Prins	894b220fbb	Fixes trac:487 Give a more intelligible error message when someone passes -nolocal and the only available node is the local node. This commit was SVN r12325. The following Trac tickets were found above: Ticket 487 --> https://svn.open-mpi.org/trac/ompi/ticket/487	2006-10-26 21:46:18 +00:00
Ralph Castain	36d4511143	Bring the timing instrumentation to the trunk. If you want to look at our launch and MPI process startup times, you can do so with two MCA params: OMPI_MCA_orte_timing: set it to anything non-zero and you will get the launch time for different steps in the job launch procedure. The degree of detail depends on the launch environment. rsh will provide you with the average, min, and max launch time for the daemons. SLURM block launches the daemon, so you only get the time to launch the daemons and the total time to launch the job. Ditto for bproc. TM looks more like rsh. Only those four environments are currently supported - anyone interested in extending this capability to other environs is welcome to do so. In all cases, you also get the time to setup the job for launch. OMPI_MCA_ompi_timing: set it to anything non-zero and you will get the time for mpi_init to reach the compound registry command, the time to execute that command, the time to go from our stage1 barrier to the stage2 barrier, and the time to go from the stage2 barrier to the end of mpi_init. This will be output for each process, so you'll have to compile any statistics on your own. Note: if someone develops a nice parser to do so, it would be really appreciated if you could/would share! This commit was SVN r12302.	2006-10-25 15:27:47 +00:00
Brian Barrett	d6ff14ed61	Hand-pack the connection information for each peer rather than just packing a sockaddr_in, as there are some endianness and padding issues with sending a sockaddr_in. Note that the sin_port and sin_addr are already in network byte order, which is why we pack them as a byte string. Refs trac:493 This commit was SVN r12301. The following Trac tickets were found above: Ticket 493 --> https://svn.open-mpi.org/trac/ompi/ticket/493	2006-10-25 15:09:30 +00:00
Ralph Castain	601443e690	Fix the other end of the attribute store/retrieve to correctly handle attributes with no value This commit was SVN r12290.	2006-10-25 01:41:58 +00:00
Ralph Castain	46166a9c77	Fix a bug that would cause a segfault when someone specified an option that had no corresponding value This commit was SVN r12288.	2006-10-24 22:06:18 +00:00
Ralph Castain	a7acd22e47	Fix a fix so we see the correct MCA param This commit was SVN r12282.	2006-10-24 17:24:09 +00:00
George Bosilca	ed3caa9fc7	mca_base_param_reg_int require a pointer to a component not a name. This commit was SVN r12281.	2006-10-24 16:41:51 +00:00
George Bosilca	7dec7731ce	Instead of size_t use orte_std_cntr_t. Remove all warnings. This commit was SVN r12280.	2006-10-24 16:40:49 +00:00
Ralph Castain	8636ac6a4d	Fix ticket 353 - print out a nice message that the combination of debug-daemons and num_concurrent in the pls rsh launcher will cause deadlock and exit This commit was SVN r12279.	2006-10-24 15:59:02 +00:00
Tim Prins	cb622db7c9	Fixes trac:352 Only close off stdout/stderr from the daemons if we are not debugging the slurm pls and --debug-daemons was not passed. This commit was SVN r12276. The following Trac tickets were found above: Ticket 352 --> https://svn.open-mpi.org/trac/ompi/ticket/352	2006-10-24 13:05:13 +00:00
Tim Prins	93d61d01fb	Fix for a problem on SLURM we have neen having since r12243 where mpirun would hang after the process had finished. It turns out that we were always reporting the name of the daemon wrong, but we simply never noticed as we never used it, until r12243. This makes it so we report the name of the daemon correctly. This commit was SVN r12274. The following SVN revision numbers were found above: r12243 --> open-mpi/ompi@153e38ffc9	2006-10-24 01:41:28 +00:00
Ralph Castain	7a77ef0ae3	Given the amount of pain singletons cause, one can't help but wonder if it REALLY was that much trouble for people to type "mpirun -n 1 foo"....sigh. Get the ordering right so that a singleton can start. Protect the rmgr copy app_context function from NULL fields Tell the mapper it is okay for there not to be a pre-existing mapping plan for a parent when dynamically spawning processes This commit was SVN r12257.	2006-10-23 15:15:45 +00:00
George Bosilca	2a863df0a5	Newline is required by some compilers at the end of a file. This commit was SVN r12244.	2006-10-21 05:56:04 +00:00
Ralph Castain	153e38ffc9	Lesson to be learned: if you send an ack to a recv'd command, better not send it to the same tag it came from - at least, not if there is a persistent recv on that tag! Fix the persistent daemon problem where it was exiting when a job completed. Problem was that the persistent daemon would order the job daemons to exit. They would then send an 'ack' back to the persistent daemon - but the ack consisted of an echo of the "exit" command, which was recv'd by the wrong listener who treated it as a properly sent cmd....and exited. This commit was SVN r12243.	2006-10-21 02:53:19 +00:00
Ralph Castain	ab7bbb80a5	Teach the mapper to correctly handle the unbalanced --host scenario. We now map in a more expected fashion. This commit was SVN r12240.	2006-10-20 20:48:24 +00:00
George Bosilca	548b94e4e1	Add a missing ORTE_DECLSPEC. This commit was SVN r12236.	2006-10-20 19:37:21 +00:00
Tim Prins	28bf4d85ab	A couple of small fixes: - It is possible to leave a byslot/bynode routine and have cur_node_item be NULL, so check for that. - After we do an allocation where the user has provided a map (i.e. with --host), cur_node_item is pointing into the map list, not the global list. Change it to point into the global list. This commit was SVN r12232.	2006-10-20 19:00:17 +00:00
Ralph Castain	955d11fa7b	The bookmark now respects slot assignments a little better. It will not oversubscribe the first node, but will take only what is available there before moving on. See the comment in orte/mca/rmaps/round_robin/rmaps_rr.c if you want the details... :-) This commit was SVN r12230.	2006-10-20 18:24:14 +00:00
Ralph Castain	ec0bb9ffda	Fix the bookmark system - we now have children being correctly spawned where they should! Also, I am no longer seeing any issue with the child job spawning its own daemons - this appears to be fixed. We still don't reuse the existing daemons, however, but that will come. This commit was SVN r12229.	2006-10-20 18:05:16 +00:00
Ralph Castain	c07d4e2510	Cleaner rendition now extended to other environments. Remove MCA params for backend procs that can cause trouble. Specifically, any directives on the selection of components for RDS, RAS, RMAPS, PLS, and RMGR can be bad mojo on the backend. This patch will cause a problem for cnos, however, as there we want to specifically tell the backends to be "null". I'm working on that issue. This commit was SVN r12225.	2006-10-20 16:50:13 +00:00
Ralph Castain	02efd07b60	Fix the MCA param passing issue, at least for rsh at the moment. I will clean this up and move it to the other environments once I shift back to a local computer. This commit was SVN r12224.	2006-10-20 15:27:29 +00:00
Ralph Castain	b07a6b1d7a	Fix a major typo that caused remote launch to crash - had something inside the wrong brace This commit was SVN r12221.	2006-10-20 14:30:23 +00:00
George Bosilca	2aa3e51223	Nothing relevant. Only a set of castings to have a clean compile on Windows. The cl.exe compiler is pretty good at complaining about any kind of non explicit cast. This commit was SVN r12207.	2006-10-20 02:25:50 +00:00
George Bosilca	7982a23bde	ORTE_DECLSPEC should be ... This commit was SVN r12206.	2006-10-20 02:23:54 +00:00
George Bosilca	5a939e21b2	Populate the file with ORTE_DECLSPEC declarations. This commit was SVN r12205.	2006-10-20 02:19:30 +00:00
Tim Prins	ade94b523b	Fixed a number of issues related to resource allocation: - Simplified the logic of the ras modules by moving the attribute handling into the base allocation function. This allows us to decide how to allocate based on the situation, and solves some of the allocation problems we were having with comm_spawn. - moved the proxy component into the base. This was done because we always want to call the proxy functions if we are not on a HNP regardless of the attributes passed. - Got rid of the hostfile component. What little logic was in it was moved into the base to deal with other circumstances. The hostfile information is currently being propagated into the registry by the RDS, so we just use what is already in the registry. - renamed some slurm function so that they have the proper prefix. Not strictly necessary as they were static, but it makes debugging much easier. - fixed a buglet in the round_robin rmaps where we would return an error when really no error occured. I tried to make proper corrections to all the ras modules, but I cannot test all of them. This commit was SVN r12202.	2006-10-19 23:33:51 +00:00
Ralph Castain	ab196c3121	Okay, this fixes the problem of MCA params spreading too far. Sorry for the multiple corrections. This commit was SVN r12201.	2006-10-19 22:51:02 +00:00
George Bosilca	c788f0dd51	Apparently, the MCA params are being loaded into the environment params of the individual app_contexts as well. Clear them of "bad" ones. This commit was SVN r12198.	2006-10-19 21:19:08 +00:00
Ralph Castain	d0d3d7fd41	Apparently, the MCA params are being loaded into the environment params of the individual app_contexts as well. Clear them of "bad" ones. This commit was SVN r12197.	2006-10-19 20:49:35 +00:00
Ralph Castain	263f4379e8	Clean up an error in the mapper that caused "-hosts" to bomb. Update the mapper so it correctly points to the next node to be used if we are mapping by slots. As it was, if we had an app_context that used only one slot on a node, the next app_context would start on the next node - leaving a blank slot in-between. This commit was SVN r12193.	2006-10-19 18:57:29 +00:00
Tim Prins	81d400ddfd	break when they are equal, not not equal This commit was SVN r12182.	2006-10-18 21:47:01 +00:00
Tim Prins	ab964d096a	Need a terminating NULL... This commit was SVN r12180.	2006-10-18 20:52:31 +00:00
Ralph Castain	d0eb7d7216	Complete the attribute management functions. Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system. Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn). This commit was SVN r12179.	2006-10-18 20:02:16 +00:00
Ralph Castain	f4a458532b	This doesn't totally resolve the comm_spawn problem, but it helps a little. I'll continue working on it and hope to resolve it completely shortly. The issue primarily centers on where to start mapping the child job's processes, and how to deal with oversubscription that might result. At the moment, I am trying to resolve the first issue first (hey, that even sounds right!). This change does a couple of things: 1. Since the USE_PARENT_ALLOC attribute is a directive about regarding allocation of resources to a job, it more properly should be an attribute of the RAS. Change the name to reflect that and move the attribute define to the ras_types.h file. 2. Add the attributes list to the RMAPS map_job interface. This provides us with the desired flexibility to dynamically specify directives for mapping. The system will - in the absence of any attribute-based directive - default to the values provided in the MCA parameters (either from environment or command-line interface). This commit was SVN r12164.	2006-10-18 14:01:44 +00:00
George Bosilca	1d69375dd7	Somehow, somewhere we have to initialize rc ... This commit was SVN r12156.	2006-10-17 22:32:21 +00:00
Ralph Castain	0c0fe022ff	This is a first cut at fixing the problem of comm_spawn children being mapped onto the same nodes as their parents. I am not convinced the behavior implemented here is the long-term right one, but hopefully it will help alleviate the situation for now. In this implementation, we begin mapping on the first node that has at least one slot available as measured by the slots_inuse versus the soft limit. If none of the nodes meet that criterion, we just start at the beginning of the node list since we are oversubscribed anyway. Note that we ignore this logic if the user specifies a mapping - then it's just "user beware". The real root cause of the problem is that we don't adjust sched_yield as we add processes onto a node. Hence, the node becomes oversubscribed and performance goes into the toilet. What we REALLY need to do to solve the problem is: (a) modify the PLS components so they reuse the existing daemons, (b) create a way to tell a running process to adjust its sched_yield, and (c) modify the ODLS components to update the sched_yield on a process per the new method Until we do that, we will continue to have this problem - all this fix (and any subsequent one that focuses solely on the mapper) does is hopefully make it happen less often. This commit was SVN r12145.	2006-10-17 19:35:00 +00:00
Tim Prins	720eb88cad	Make no-op function match new interface. This commit was SVN r12142.	2006-10-17 17:34:06 +00:00
Tim Prins	8b0170148e	Add some missing headers. This commit was SVN r12141.	2006-10-17 17:28:02 +00:00
Tim Prins	5d31332f97	Goodbye poe, long live LoadLeveler... This commit was SVN r12140.	2006-10-17 17:07:48 +00:00
Ralph Castain	13227e36ab	This commit looks a lot bigger than it is, so relax :-) Fix the problem observed by multiple people that comm_spawned children were (once again) being mapped onto the same nodes as their parents. This was caused by going through the RAS a second time, thus overwriting the mapper's bookkeeping that told RMAPS where it had left off. To solve this - and to continue moving forward on the ORTE development - we introduce the concept of attributes to control the behavior of the RM frameworks. I defined the attributes and a list of attributes as new ORTE data types to make it easier for people to pass them around (since they are now fundamental to the system, and therefore we will be packing and unpacking them frequently). Thus, all the functions to manipulate attributes can be implemented and debugged in one place. I used those capabilities in two places: 1. Added an attribute list to the rmgr.spawn interface. 2. Added an attribute list to the ras.allocate interface. At the moment, the only attribute I modified the various RAS components to recognize is the USE_PARENT_ALLOCATION one (as defined in rmgr_types.h). So the RAS components now know how to reuse an allocation. I have debugged this under rsh, but it now needs to be tested on a wider set of platforms. This commit was SVN r12138.	2006-10-17 16:06:17 +00:00
Rainer Keller	3f88937081	- Error logging is really not yet enabled. - Correct the error log for orte_errmgr_base_select - Spelling fixes This commit was SVN r12135.	2006-10-17 09:11:20 +00:00
Ralph Castain	16e52c5784	Fix a non-compliance issue regarding hostfiles as reported by Sun. The man page states that an entry that specifies slots_max but does not specify "slots" will have the soft limit default to the hard limit. The hostfile implementation, however, defaulted the soft limit to 1. This fix changes that behavior to conform to the man page. This commit was SVN r12129.	2006-10-17 00:43:12 +00:00
Ralph Castain	3f55d6897a	Remove the memory debugging options. Fix what appears to be a typo in a help file. This commit was SVN r12107.	2006-10-12 00:44:48 +00:00
Brian Barrett	fce5130333	Delay opening the listen socket until module init, so that we can have the seed value have something set to true. Allow selection of the listen type to thread if (and only if) the process is the HNP... This commit was SVN r12105.	2006-10-11 21:29:29 +00:00
Brian Barrett	29c91cf2f3	* Fix issue in odls_bproc where we were using vpid instead of the number of processes launched locally for the stdio file names. This was causing the expected files to not exist and bproc_vexecmove_io to fail. * Clean up a bunch of debugging output in the bproc pls This commit was SVN r12102.	2006-10-11 20:34:12 +00:00
Ralph Castain	f91a95b3fe	Fix the bug that caused mpirun to hang when a remote executable wasn't found using the rsh launcher. Will now test on a remote node This commit was SVN r12095.	2006-10-11 18:43:13 +00:00
Ralph Castain	2da8245be0	Correctly propagate no-daemonize This commit was SVN r12093.	2006-10-11 17:53:17 +00:00
Ralph Castain	e5877cc459	Add the proper valgrind params This commit was SVN r12092.	2006-10-11 17:48:41 +00:00
George Bosilca	b56636c855	orte_pls belong to the PLS framework, therefore it should only be defined in pls.h. This commit was SVN r12089.	2006-10-11 17:12:22 +00:00
Ralph Castain	27e305347c	Add a couple of options to orterun that support debugging of daemons for memory corruption. Ensure that the environment provided to local application processes isn't "polluted" by the orteds This commit was SVN r12087.	2006-10-11 15:18:57 +00:00
Brian Barrett	f5b8f1f2f0	Work around Automake not knowing how to properly configure libtool to build Objective C libraries Refs trac:483 This commit was SVN r12080. The following Trac tickets were found above: Ticket 483 --> https://svn.open-mpi.org/trac/ompi/ticket/483	2006-10-10 20:14:26 +00:00
Ralph Castain	699ffcf359	Restore the "bynode" mapping functionality - accidentally deleted setting of parameter This commit was SVN r12078.	2006-10-10 19:41:22 +00:00
George Bosilca	7dadc1832d	Correctly export the required functions. They are defined in a private file, but they are completely public. This commit was SVN r12070.	2006-10-10 04:54:51 +00:00
Ralph Castain	0e9dc590b7	Fix typo that didn't make it over from testing on vogon This commit was SVN r12068.	2006-10-09 20:37:39 +00:00
Ralph Castain	cebdc51762	Remove a debugging output This commit was SVN r12066.	2006-10-09 02:10:52 +00:00
Ralph Castain	2e09128337	Many thanks to Jeff for tracking down the typo causing the orte_job_map_t destuctor to fail!! Restore the OBJ_RELEASE calls to cleanup map objects. This commit was SVN r12064.	2006-10-07 22:44:00 +00:00
Ralph Castain	98dd57b70e	Add a new option to launch "pernode" - launches one process/node across all available nodes. The other options also work correctly: "-bynode" with no -np will launch on all slots, mapped on a per-node basis. This commit was SVN r12063.	2006-10-07 19:50:12 +00:00
Jeff Squyres	efe28d62e9	Fix some compiler errors. I have not checked this for correctness; but it does now compile. This commit was SVN r12062.	2006-10-07 19:10:56 +00:00
Ralph Castain	ae79894bad	Bring the map fixes into the main trunk. This should fix several problems, including the multiple app_context issue. I have tested on rsh, slurm, bproc, and tm. Bproc continues to have a problem (will be asking for help there). Gridengine compiles but I cannot test (believe it likely will run). Poe and xgrid compile to the extent they can without the proper include files. This commit was SVN r12059.	2006-10-07 15:45:24 +00:00
Ralph Castain	82a023c731	Fix a typo that caused a segfault if a caller requested that we abort an array of procs (i.e., MPI_Abort when it specifies the other procs to be aborted). Should hopefully address the recent problem seen with the BLACS AUX test as discussed on the user mailing list. This commit was SVN r12055.	2006-10-07 01:58:11 +00:00
Ralph Castain	ee0df85ece	Add a stupid, useless const since someone put it in the odls.h file. This commit was SVN r12054.	2006-10-07 01:42:23 +00:00
George Bosilca	cda46efd2a	Some missing DECLSPEC This commit was SVN r12047.	2006-10-06 15:21:52 +00:00
George Bosilca	017af37291	Keep only the useful _DECLSPEC and _DECLSPEC some globals. This commit was SVN r12042.	2006-10-06 07:24:34 +00:00
George Bosilca	b7579b09c7	Correctly handle the ORTE_DECLSPEC attribute. This commit was SVN r12041.	2006-10-06 07:08:17 +00:00
George Bosilca	422ce1d3f8	If we are in the ODLS framework then the types should be called odls not pls. This commit was SVN r12040.	2006-10-06 07:05:58 +00:00
George Bosilca	7fed79434e	Windows is now able to create local processes. This commit was SVN r12039.	2006-10-06 07:04:43 +00:00
George Bosilca	fa01b9b9aa	Last step for the name reversion (from windows back to process). This commit was SVN r12014.	2006-10-05 06:36:11 +00:00
George Bosilca	b7a793a6db	Rename all the files in the process directory. This commit was SVN r12013.	2006-10-05 06:34:30 +00:00
George Bosilca	fee0909815	Rename the Windows component. This commit was SVN r12012.	2006-10-05 06:32:57 +00:00
George Bosilca	c79c436c8d	Cleanups. Remove all __WINDOWS__ checks as this module will never get compiled on Windows. This commit was SVN r12011.	2006-10-05 06:17:30 +00:00
George Bosilca	dbe7f8ac32	Always return bool. This commit was SVN r12009.	2006-10-05 05:45:18 +00:00
George Bosilca	d1e884fbf5	Make sure we always return a bool (true/false). This commit was SVN r12002.	2006-10-05 05:27:46 +00:00
George Bosilca	3a34f9340e	If the enum is defined inside the struct it will has a scope. We don't really need that. This commit was SVN r12001.	2006-10-05 05:27:04 +00:00
George Bosilca	090b8a9098	opal_list_is_empty return true or false ... This commit was SVN r12000.	2006-10-05 05:26:08 +00:00
George Bosilca	03083cc1f6	Don't release the values[0] before it get initialized. This commit was SVN r11999.	2006-10-05 05:25:18 +00:00
George Bosilca	fd76e56279	One protection against C++ compilers is more than enough. This commit was SVN r11998.	2006-10-05 05:24:43 +00:00
George Bosilca	ad5810e33f	ORTE_DECLSPEC what needs to be ORTE_DECLSPES. This commit was SVN r11997.	2006-10-05 05:22:22 +00:00
Ralph Castain	faf3a558e6	Missing CR at end of file This commit was SVN r11959.	2006-10-03 18:17:52 +00:00
Ralph Castain	cd7d87aa7b	Define the map data types for dss compatibility. Setup to debug bproc This commit was SVN r11955.	2006-10-03 17:40:00 +00:00
Ralph Castain	4e39878944	Add a "dump" capability to the DSS so one can display a single data value to an output stream. Add some comments to the map type def in prep for building its data type support. This commit was SVN r11947.	2006-10-03 08:40:35 +00:00
Ralph Castain	99f2986db7	Bring comm_spawn back online. Shift the trigger hosting responsibilities to the HNP. We still have an issue with the io forwarding going through the spawning process, but that will be dealt with at a future time. This commit was SVN r11943.	2006-10-03 02:07:58 +00:00
Ralph Castain	b269e4da9b	Add missing functionaltiy to the ns replica to support remote get_job_peers requests. Add trace commands to help try and track down remaining problem with comm_spawn. This commit was SVN r11939.	2006-10-02 19:44:35 +00:00
Ralph Castain	9eb14425b7	The last of the debug messages that keep hiding. My apologies. This commit was SVN r11937.	2006-10-02 18:43:32 +00:00
Ralph Castain	3fd67a038f	Bring comm_spawn and persistent operations online. Still some minor problems, though - so don't use them yet, please. This won't affect anything except those two scenarios. This commit was SVN r11936.	2006-10-02 18:29:15 +00:00
Ralph Castain	12328395ae	Missed a couple of debug statements This commit was SVN r11935.	2006-10-02 15:46:41 +00:00
Ralph Castain	7494a7a83f	Clean out some debugging statements that were inadvertently left in the commit This commit was SVN r11933.	2006-10-02 15:03:18 +00:00
Ralph Castain	559b9b0ae8	Continue beating on comm_spawn. Setup to debug bproc. This commit was SVN r11932.	2006-10-02 14:58:22 +00:00
Ralph Castain	65593cd67e	Fix a few Cyrador warnings This commit was SVN r11930.	2006-10-02 13:00:32 +00:00
Brian Barrett	8f7ab1c584	num_procs can be zero if something went partly wrong before. This will cause a math exception on some platforms, so don't let that happen. This commit was SVN r11929.	2006-10-02 01:27:22 +00:00
Ralph Castain	121f834776	Continue bringing comm_spawn back online. Ensure all RM frameworks post their HNP receives. Fix the rmgr proxy component. Still need some work on the proxy component, and on job termination for persistent daemon case. This commit was SVN r11928.	2006-10-02 00:46:31 +00:00
Brian Barrett	e464adcd51	Need to tell the daemon how many procs it will start (always 1, because of the way we do the fake mapping thing... This commit was SVN r11924.	2006-10-01 23:25:22 +00:00
Brian Barrett	95ba51fbd4	* Clean up debugging output so that it's useful * Error message in an NSError object is localizedDescription, not localizedErrorReason. The latter is a decription of how the error can occur, which is usually nothing in XGrid frameworks. * Clean up silly error in finding the Kerberos Service Principal when using Kerberos authenticaion * Print useful error message when a connection unexpectedly closes, as this is usually authentication related... This commit was SVN r11923.	2006-10-01 22:43:17 +00:00
Brian Barrett	9807a38458	Always initialize the base output stream, but only set verbose if requested. Otherwise, the PLS components have pay more attention to debugging streams than the rest of the OMPI source code This commit was SVN r11922.	2006-10-01 22:37:30 +00:00
Ralph Castain	33a46cff40	Grrrr....ok, this time actually put a value in the silly command buffer! This commit was SVN r11901.	2006-09-29 21:44:11 +00:00
Ralph Castain	0625ead54a	Fix comm_spawn communications This commit was SVN r11900.	2006-09-29 21:10:48 +00:00
Ralph Castain	0411f9772e	Begin instrumenting for scalability tests. I have added a new MCA param (hey, you can't have too many!) called OMPI_MCA_orte_timing. If set to anything other than zero, the system will report out critical timing loops. At the moment, this includes three measurements: 1. Time spent going through the RDS->RAS->RMAPS, setting up triggers, etc. prior to calling the actual PLS launch function. This is reported out as time to setup job. 2. Time spent in MPI_Init from start of that function (well, right after opal_init) to the place where we send all of our info the registry. Reported out as time from start to exec_compound_cmd 3. Time actually spent executing the compound cmd. Reported out as time to exec_compound_cmd. A few additional timing points will be added shortly. These may eventually be removed or (better) setup with a conditional compile flag. This commit was SVN r11892.	2006-09-29 13:19:44 +00:00
Ralph Castain	db6a93fa63	Fix a couple of reported issues: 1. PLS finalize was not being called. Now ensure that happens during orte_finalize. 2. Errmgr proxies were sending their messages to the wrong tag - typical cut/paste error. This commit was SVN r11891.	2006-09-29 12:45:50 +00:00
Tim Prins	1b35e7adff	cleanup This commit was SVN r11863.	2006-09-28 13:28:48 +00:00
Brian Barrett	d00a0de716	* It appears that in their infinite wisdom, Apple removed the __DARWIN_ALIGN_POWER define from the last release of the OS X compiler toolchain. The bug in net/if.h, however, is still there. So look for the hints that we're on a 64 bit Apple PowerPC instead. * If we don't find a buffer size that works by 10MB, we're never going to. So add some code to limit the buffer size we'll try so that we don't fall into an infinite loop * Detect errors in opal_ifcount in the oob init code Refs trac:420 This commit was SVN r11825. The following Trac tickets were found above: Ticket 420 --> https://svn.open-mpi.org/trac/ompi/ticket/420	2006-09-26 16:37:04 +00:00
Brian Barrett	8943f583bf	quiet some debugging output This commit was SVN r11813.	2006-09-26 04:10:07 +00:00
Brian Barrett	d8d55a760f	First attempt at Kerberos support for the XGrid process starter refs trac:345 This commit was SVN r11812. The following Trac tickets were found above: Ticket 345 --> https://svn.open-mpi.org/trac/ompi/ticket/345	2006-09-26 03:54:38 +00:00
Brian Barrett	9733c8e3bd	Update XGrid RAS and PLS to the new infrastructure. Not yet super well tested, but starting to get there... This commit was SVN r11810.	2006-09-26 03:26:45 +00:00
Brian Barrett	3c814fdd23	fixes trac:391 Fix for double mutex free that would cause an abort condition in the orted whenever threads were enabled. This commit was SVN r11759. The following Trac tickets were found above: Ticket 391 --> https://svn.open-mpi.org/trac/ompi/ticket/391	2006-09-22 19:24:42 +00:00
Andrew Friedley	798c19d395	Blah.. we should always return after try_connect() here, not just when we have an error. Another fix for ticket #362. This commit was SVN r11756.	2006-09-22 15:51:11 +00:00
Tim Prins	567676f3c1	- Formatting and minor cleanup - made it so we now set the architecture of each node we discover - remove debugging output This commit was SVN r11751.	2006-09-22 13:24:32 +00:00
Tim Prins	83a7f6e4de	Fix for bug #369 . LoadLeveler only sets LOADL_PROCESSOR_LIST when there are 128 or less tasks allocated to a job. The POE RAS relied on this variable so I created a new RAS which uses the LoadLeveler API instead of relying on the environment variable. This still needs some testing, so for now we use the POE RAS whenever LOADL_PROCESSOR_LIST, otherwise we fall back on this component. Unfortunately, this will require an autogen... This commit was SVN r11732.	2006-09-21 00:08:49 +00:00
Andrew Friedley	8895bf7369	Fix the fix (r11718) for bug #362 . We were still waiting the entire duration of the timeout before we figured out that a connect() was successful. Re-introduce adding the peer_send_event so that we detect immediately when a connect() completes. Also make sure to delete the timeout event in complete_connect(). Fixed a struct timeval initialization warning reported by Jeff. Remove an erroneous opal_output(). This commit was SVN r11724. The following SVN revision numbers were found above: r11718 --> open-mpi/ompi@1b6231a9b5	2006-09-20 14:29:37 +00:00
Andrew Friedley	1b6231a9b5	Fix for running jobs that span multiple 's' partitions on IU BigRed. Each 's' partition has its own TCP network. It's fine to use this network for jobs that fit inside the partition, but the TCP OOB errors when trying to connect across two partitions, because there are two disjoint networks. Each node also has another TCP network connecting ALL nodes together. So the solution is to actually try all the available TCP interfaces on a node, instead of erroring when the first one fails. Also, the default TCP connect() timeout is way too long (5 minutes) - use our own timeout mechanism, with the timeout value expressed as an MCA parameter. This commit was SVN r11718.	2006-09-19 19:33:49 +00:00
Tim Prins	c4db5654fa	Fix for bug #370 The POE ras did not correctly enter the number of slots per node. This fixes that. This commit was SVN r11716.	2006-09-19 16:27:15 +00:00
Ralph Castain	977e3c5ca1	Let's see if Cyrador understands this version a little better... This commit was SVN r11709.	2006-09-19 13:05:40 +00:00
Ralph Castain	0ad0d84afd	Add two new API functions to the RMGR, and modify the "spawn" API to support the enhanced MPI-2 functionality. No implementation backs these new APIs - just placeholders for now. This commit was SVN r11699.	2006-09-19 01:45:05 +00:00
Ralph Castain	d7e61e40fc	Quiet a few warnings from Cyrador This commit was SVN r11686.	2006-09-18 12:40:42 +00:00
Ralph Castain	8a291afda6	Ensure the rds_private.h file gets included in the distribution This commit was SVN r11682.	2006-09-16 11:45:02 +00:00
Ralph Castain	f906af983a	Forgot to change the silly Makefile.am names - sorry Cyrador! This commit was SVN r11670.	2006-09-15 04:52:20 +00:00
Jeff Squyres	3e239f4532	Add a missing .ompi_ignore This commit was SVN r11666.	2006-09-15 02:36:22 +00:00
George Bosilca	4fe39a4e7d	The old PLS is now called a ODLS. However, the real name is not windows but process. This change will follow shortly... This commit was SVN r11663.	2006-09-14 22:22:34 +00:00
Ralph Castain	37dfdb76eb	Here is the major MAD-cure commit. I have written plenty about it, so I refer you here to those messages for a description of everything that was done. This commit was SVN r11661.	2006-09-14 21:29:51 +00:00
George Bosilca	17afe7dc9f	Do it on the correct way as this is normally compiled as a module. This commit was SVN r11660.	2006-09-14 21:22:41 +00:00
George Bosilca	01c5a115b2	Don't export the POE module. Only the component have to be exported (visible). This commit was SVN r11659.	2006-09-14 21:20:31 +00:00
Josh Hursey	908f31fe9f	Fix a code clarity issue in the POE PLS. Allow the POE RAS to be compled for linux as well as AIX. The POE RAS is really a Loadleveler RAS, and IU now has a cluster that uses Loadleveler in a Linux environment (BigRed). This seems to be the only thing we need to do so far to run Open MPI on BigRed. Yay :) This commit was SVN r11600.	2006-09-09 05:13:15 +00:00
Josh Hursey	160120b4c5	Fix a cut-n-paste error that causes the 'num_concurrent' to be set to 1 or 0 instead of the user defined number or default (128). This caused the PLS to deadlock when using '--debug-daemons' with more than 2 processes. :( svn blame says that it was broken in r11347 It is not a problem on v1.1 or v1.2 branches. Bug spotted by Tim Mattox and myself. This commit was SVN r11575. The following SVN revision numbers were found above: r11347 --> open-mpi/ompi@f52c10d18e	2006-09-08 15:17:17 +00:00
Jeff Squyres	0f11584a6c	* Update svn:ignore * Remove svn:executable from non-executable files This commit was SVN r11555.	2006-09-07 17:17:40 +00:00
Ralph Castain	9e6e9b8619	Fix a couple of variable declarations This commit was SVN r11467.	2006-08-28 13:28:10 +00:00
George Bosilca	c2311f6e42	Don't define the yywrap function. This commit was SVN r11459.	2006-08-28 04:11:25 +00:00
George Bosilca	693c835137	No need to cast as the returned value is already in the expected type. This commit was SVN r11458.	2006-08-28 04:10:43 +00:00
George Bosilca	ba1514f2e7	A slightly more Windows friendly version. Unfortunately there is no support for SGE on Windows. This commit was SVN r11436.	2006-08-27 04:46:43 +00:00
Pak Lui	131f0eff04	fix the verbose value. This commit was SVN r11418.	2006-08-24 21:30:08 +00:00
Pak Lui	65a524dd0d	- need to provide option for showing the grid engine's JOB_ID in case the grid engine job needs to be killed - clean up the orted_path and debug message This commit was SVN r11413.	2006-08-24 20:27:19 +00:00
Pak Lui	4f75dfd353	- missed the opal_os_path() for LD_LIBRARY_PATH This commit was SVN r11410.	2006-08-24 18:58:50 +00:00
George Bosilca	9110ea2b80	Add the Windows fork component. As fork is not available on Windows, I create a process component which use CreateProcess to spawn the child. Special care should be taken in order to correctly redirect the stdin, stdout and stderr of the child process. This commit was SVN r11405.	2006-08-24 17:51:20 +00:00
George Bosilca	0d607c1346	Use opal_os_path and OPAL_PATH_SEP to build the file path. I don't have any machine to test, so I hope I get it right. This commit was SVN r11398.	2006-08-24 16:20:32 +00:00
Pak Lui	5220c1ca42	- converted some tabs into spaces This commit was SVN r11384.	2006-08-23 23:21:08 +00:00
Pak Lui	9dda057f05	- Do the changes as in r11347 for gridengine to use opal_os_path(). - Remove extra NULL argument from rsh module. This commit was SVN r11377. The following SVN revision numbers were found above: r11347 --> open-mpi/ompi@f52c10d18e	2006-08-23 20:40:01 +00:00
Jeff Squyres	715bae369c	Remove extra argument - now obsoleted by the use of opal_os_path(). This commit was SVN r11366.	2006-08-23 14:32:06 +00:00
Brian Barrett	e39f0096a0	* add header file to sources list so make dist works This commit was SVN r11357.	2006-08-23 13:31:56 +00:00
George Bosilca	c03ef692c1	And the missing header. This commit was SVN r11348.	2006-08-23 03:33:35 +00:00
George Bosilca	f52c10d18e	And ORTE is ready for prime-time. All Windows tricks are in: - use the OPAL functions for PATH and environment variables - make all headers C++ friendly - no unamed structures - no implicit cast. Plus a full implementation for the orte_wait functions. This commit was SVN r11347.	2006-08-23 03:32:36 +00:00
George Bosilca	aecdfc80eb	Don't orget to relase the object if we detect an error. This commit was SVN r11346.	2006-08-23 02:43:05 +00:00
Ralph Castain	c3ba1c1cc1	Fix a pack/unpack mismatch This commit was SVN r11315.	2006-08-22 13:50:59 +00:00
Ralph Castain	73a7916946	For Ollie...fix a few names. Should help the Bproc SMR component compile. This commit was SVN r11284.	2006-08-21 15:11:20 +00:00
George Bosilca	6afa4c6c64	Windows friendly version. We have to split the OMPI_DECLSPEC in at least 3 different macros, one for each project. Therefore, now we have OPAL_DECLSPEC, ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project. This commit was SVN r11270.	2006-08-20 15:54:04 +00:00
Ralph Castain	ee04e04dd0	Attempt to cleanup the xgrid pls module This commit was SVN r11261.	2006-08-18 21:21:31 +00:00
Ralph Castain	6bf06d4602	Fix connect-accept by cleaning up two minor bugs. This commit was SVN r11260.	2006-08-18 21:12:03 +00:00
Ralph Castain	517d6fda49	Add the smr_private include file so it gets put in tarballs This commit was SVN r11243.	2006-08-17 12:24:44 +00:00
Ralph Castain	8c7f0ed9ae	Change the SOH to the new State Monitoring and Reporting (SMR) framework. New API's will be appearing in the new framework shortly - this just gets the name change into the system. Other changes: 1. Remove the old xcpu components as they are not functional. 2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one. This will require an autogen/configure, I'm afraid. This commit was SVN r11228.	2006-08-16 16:35:09 +00:00
Ralph Castain	5dfd54c778	With the branch to 1.2 made.... Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced). Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up). I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t). In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but... Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems. This commit was SVN r11204.	2006-08-15 19:54:10 +00:00
Brian Barrett	cd7b138d74	propogate up errors when setting up standard input forwarding This commit was SVN r11187.	2006-08-14 21:09:05 +00:00
Ralph Castain	d2912f03e0	Cleanup a historical naming convention problem. Move the socket_errno definitions to the OPAL layer and change the name accordingly. This cleans up some interrelationship issues as well as removing a name confusion. This commit was SVN r11186.	2006-08-14 20:14:44 +00:00
Ralph Castain	663e25f7cb	Finalize the Bproc vpid algorithm. Bproc is now fully operational and supports oversubscribed conditions for both bynode and byslot mapping procedures. This commit was SVN r11180.	2006-08-14 19:16:11 +00:00
Ralph Castain	285aea1c0c	Update to bproc algorithm to support oversubscription - committing to move to another test environment. Note that this may break bproc for the moment. This commit was SVN r11178.	2006-08-14 18:34:13 +00:00
Ralph Castain	de9156552b	I have confirmed that the later version of the bproc launcher does support Bproc 3, so it appears that the outdated bproc_seed launcher truly is no longer required. This commit was SVN r11164.	2006-08-12 07:47:21 +00:00
Ralph Castain	0ccc910485	Fix the Bproc vpid computation so that, when we map by slot, adjacent processes have vpids differing by only one. I will ammend the documentation in the files shortly to explain why this was previously broken. This commit was SVN r11162.	2006-08-11 19:41:33 +00:00
Pak Lui	8fab3d5b82	* Inadvertently removed a wrong variable during the last change. This commit was SVN r11157.	2006-08-11 16:00:39 +00:00
Ralph Castain	59d6f1e2eb	Remove ompi_ignores on gridengine components as this seems resolved - thanks Pak for quick response! Fixed a few very minor compiler complaints in the pls_gridengine_module.c file. ISO C is less forgiving about where variables get declared. This commit was SVN r11156.	2006-08-11 15:32:17 +00:00
Pak Lui	99a0521e44	* Fix the issue that Ralph observed in MacOS X with an invalid header file and other warnings. This commit was SVN r11155.	2006-08-11 15:04:51 +00:00
Ralph Castain	5fd6306c2f	Add ompi_ignores until the configuration can be fixed This commit was SVN r11154.	2006-08-11 14:11:41 +00:00
Pak Lui	08352878cc	* Added in new ras and pls components to support Sun N1 Grid Engine (N1GE) 6 and its open source version as the job launchers for ORTE. This commit was SVN r11153.	2006-08-10 21:46:52 +00:00
Ralph Castain	bd937b219d	Tell xcast not to send to processes that have "aborted". One of those fixes that has been sitting on another branch for awhile...sigh. This commit was SVN r11142.	2006-08-09 18:23:43 +00:00
Ralph Castain	8496b6aff4	When a "fork" launch cannot find the executable, the system used to just return an error. This meant that the state of that process was never updated in the registry, leaving the counters at the incorrect levels. As a result, the triggers would never fire to indicate that the job had been aborted. This left orterun and other orteds/processes hanging. This fix should fix the problem. I will test it on a broader range of systems forsooth... This commit was SVN r11140.	2006-08-09 15:29:08 +00:00
Ralph Castain	ddd575d126	Ensure that the localhost gets placed on the registry with the same name as found in the system_info structure. Otherwise, we wind up with confusion in the session directory names. This commit was SVN r11139.	2006-08-09 15:26:37 +00:00
Brian Barrett	59844f2119	Galen noticed that the soh component wasn't linking against the bproc libraries. Fix that issue. This commit was SVN r11119.	2006-08-07 16:20:33 +00:00
Brian Barrett	16186978bb	- Fix some compile issues in r11109 - indent / whitespace cleanup - don't set --daemon-debug when pls debug is given, as it seems to make the daemons abort. This commit was SVN r11113. The following SVN revision numbers were found above: r11109 --> open-mpi/ompi@da7df6d257	2006-08-03 18:51:42 +00:00
Galen Shipman	da7df6d257	monitor bproc node state and terminate the job if a node in our job goes down.. This commit was SVN r11109.	2006-08-03 05:29:49 +00:00
Josh Hursey	d1e1a68645	This commit contains the necessary changes to get "mpirun a.out" working correctly with MPI_Comm_spawn. The problem wiht MPI_Comm_spawn was that the 'parent' process was rmgr.create'ing and then rmgr.launch'ing the children via the rmgr proxy component. The HNP saw these commands and processed them normally, but since we never went through the HNP's rmgr (urm component) spawn() logic the triggers and key/value pairs were never created. So the children were launched correctly, but since the HNP did not have any triggers setup, never triggered the xcast for the children to finish orte_init(). This fix puts the trigger and key/value pair initialization in rmgr_urm_spawn() for the 'mpirun a.out' case, and in the rmgr_base_unpack routine that deals with the creation of the job for the child as requested by the proxy component. This will allow the triggers to be registered for the proxy's request which only happens during MPI_Comm_spawn* Small change for a lot of debugging. Notice that his reverts r11037 to its previous version, and adds a newline to handle the spawn cases. This commit was SVN r11046. The following SVN revision numbers were found above: r11037 --> open-mpi/ompi@5813fb7d2a	2006-07-28 17:17:31 +00:00
Josh Hursey	5813fb7d2a	It seems that MPI_Comm_spawn{_multiple} has been broken since r10708 By reverting this file (changeset from commit r10708) to its previous version fixes the problem. This should be moved to the v1.1 branch where it is also broken. This commit was SVN r11037. The following SVN revision numbers were found above: r10708 --> open-mpi/ompi@febc143d8c	2006-07-27 21:21:10 +00:00
Brian Barrett	c744f650ba	* really didn't mean for this patch (the threaded accept() code) to come in with r10841, so revert it (and it's fixes) out. Will bring back once cleaned up from the code used in the tbird experiment This commit was SVN r10991. The following SVN revision numbers were found above: r10841 --> open-mpi/ompi@dfa1221c3b	2006-07-25 22:32:01 +00:00
Jeff Squyres	c2d4dfce78	Remove unused variable This commit was SVN r10985.	2006-07-25 21:43:21 +00:00
Jeff Squyres	bdab8d744c	Send a pointer to the data, not the data itself. Otherwise, we could get a segv in some cases. This commit was SVN r10984.	2006-07-25 21:42:44 +00:00
Ralph Castain	65acc9325a	Fix a bug that crept in during the last change to support "mpirun a.out" operations. Since we now reserve a range of vpids for each app_context, we no longer need to track the rank and offset the starting vpid each time through the mapper - the name service automatically accounts for the offset when allocating the next starting vpid for the job. This should be shifted to v1.1. This commit was SVN r10916.	2006-07-20 21:06:15 +00:00
Ralph Castain	8bec270f90	Fix a bug noted by Jeff - we were no longer accurately recording in the registry that a process had been terminated when the user initiated the "kill" process (via cntrl-c). Added another system-level test function for ORTE that just spins until terminated by a ctrl-c signal. Modified orterun - added a couple of newlines to the output when abnormally terminating so the prompt always is on a new line. This commit was SVN r10866.	2006-07-18 14:42:27 +00:00
Gleb Natapov	f15fc4ef2f	include signal.h for SIGPIPE definition This commit was SVN r10863.	2006-07-18 09:07:53 +00:00
Brian Barrett	2185c059e8	* use opal_free_list_item_t as the type of items stored in an opal_free_list_t, rather than assuing it's an opal_list_item_t. This commit was SVN r10860.	2006-07-17 21:51:50 +00:00
Jeff Squyres	82161d20ca	Catch a SIGPIPE and allow it to be harmless. Register a no-op SIGPIPE handler before the write() and de-register it afterwards. Determine if the write() succeeded or failed by the return of write(). This commit was SVN r10858.	2006-07-17 21:15:56 +00:00
George Bosilca	33a7634009	Silence the compiler. This commit was SVN r10851.	2006-07-17 17:13:28 +00:00
Ralph Castain	404acc9f65	It's okay to call index prior to anything being put in the registry... This commit was SVN r10848.	2006-07-17 14:31:42 +00:00
Ralph Castain	574a6f7896	Fix a bug that caused the system to crash when asked for an index of the segment names. Such a request required passing a NULL value for the segment name, but the find_seg function didn't protect itself from that value. Thanks to James Kennedy (UCC-Ireland) for finding it. This commit was SVN r10847.	2006-07-17 13:51:07 +00:00
Brian Barrett	dfa1221c3b	* AC_CONFIG_LINKS has a minor problem in that it always uses ln -s, rather than $(LN_S). This causes problems with with Windows and probably elsewhere (re: #200). So use a slightly different trick to get the right header selected for the MEMCPY and TIMER components. * Using the same trick used to solve the AC_CONFIG_LINKS problem, stop using a separate header file for direct calling in the PML and MTL. This lets me remove some icky code in ompi_mca.m4 that was more fragile than I really liked. This commit was SVN r10841.	2006-07-16 04:23:52 +00:00
Jeff Squyres	ffddfc5629	Turns out that it's a really Bad Idea(tm) to tm_spawn() and then not keep the resulting tm_event_t that is generated because the back-end TM library actually caches a bunch of stuff on it for internal processing, and doesn't let go of it until tm_poll(). tm_event_t's are similar to (but slightly different than) MPI_Requests: you can't do a million MPI_Isend()'s on a single MPI_Request -- a) you need an array of MPI_Request's to fill and b) you need to keep them around until all the requests have completed. This commit was SVN r10820.	2006-07-14 22:04:41 +00:00
Rainer Keller	50b5791969	- Release best_item - Reformat This commit was SVN r10814.	2006-07-14 19:55:14 +00:00
Ralph Castain	7b3ced80e8	Fix a bug that has been causing inconsistent behavior on a number of platforms. Will explain more on the core-devel list. Jeff: this needs to be back-patched to our supported prior releases. I'll try to verify how far back we need to go - my initial guess is probably all of them This commit was SVN r10801.	2006-07-14 14:16:20 +00:00
Ralph Castain	cef1ce19d6	Restore the "sleep" delay during startup. Since Jeff and I are going to a branch for T-bird, we have restored the trunk to its prior state to avoid any possibility of disturbing it. This commit was SVN r10774.	2006-07-12 22:18:53 +00:00
Jeff Squyres	ef8433a60b	After more discussion on the phone, it seems easier to not muck around in special components but rather go down to a /tmp branch. So removing these components and I'll branch next. This commit was SVN r10771.	2006-07-12 22:12:29 +00:00
Jeff Squyres	62c189ea1c	Fix a few blanket search/replaces This commit was SVN r10768.	2006-07-12 21:54:05 +00:00
Ralph Castain	badd3f4acb	Clean up a few lingering references to "urm". This commit was SVN r10765.	2006-07-12 21:01:21 +00:00
Jeff Squyres	36ca7497d1	Update m4 and configure files This commit was SVN r10764.	2006-07-12 20:55:39 +00:00
Ralph Castain	9102b5af3b	Remove the "sleep" delay in the oob connection procedure. This shouldn't cause any problems, especially for launches of less than 1000 processes. Please report any abnormal behavior during launch, though, as we would like to understand what (if any) impact is seen. I couldn't see any on small jobs (the modulo functions render this number down pretty low). This commit was SVN r10763.	2006-07-12 20:31:30 +00:00
Ralph Castain	a84898316c	Create new components to support Thunderbird scalability development This commit was SVN r10762.	2006-07-12 20:28:23 +00:00
Brian Barrett	4b70bb92db	* Per ticket #112 , localhost checks should check against 127.0.0.1/8, rather than just 127.0.0.1. This commit was SVN r10750.	2006-07-11 20:54:49 +00:00
Ralph Castain	11125dd67a	George has a retarded compiler - but that's okay. This will quiet it's warning system. This commit was SVN r10736.	2006-07-11 15:27:02 +00:00
George Bosilca	3daa063772	Make the format and the arguments matchs. This commit was SVN r10734.	2006-07-11 15:10:44 +00:00
Josh Hursey	9a31060b6d	Fix r10725 so that the trunk builds again. This commit was SVN r10733. The following SVN revision numbers were found above: r10725 --> open-mpi/ompi@ae222cca5b	2006-07-11 14:48:31 +00:00
Ralph Castain	ae222cca5b	Include the help file so it can be accessed This commit was SVN r10725.	2006-07-11 12:15:25 +00:00
Ralph Castain	6129a5a887	Enable -host support for "mpirun a.out". You can now execute on all slots on specified nodes within your overall allocation. This commit was SVN r10713.	2006-07-11 02:59:23 +00:00
George Bosilca	a9df5035f9	Remove unused variable. This commit was SVN r10712.	2006-07-11 00:30:51 +00:00
Ralph Castain	febc143d8c	Per LANL's stated need, add functionality that runs a.out across ALL available process slots if no num_proc is specified on the command line. However, please note the following limitation: we ONLY allow ONE application to be specified on the command line when this feature is invoked. If multiple apps are specified, the user MUST also specify the number to be launched for each and every one of them. Update the help text to report errors when not following that rule. Also updated the RMAPS help text to reflect the reorganization of some of the round-robin code into the base. The new functionality has been tested under Mac OS-X and on Odin using an MPI program. Both byslot and bynode mapping have been checked and verified. Operational support for other systems needs to be verified - I respectfully request people's help in doing so. This commit was SVN r10708.	2006-07-10 21:25:33 +00:00
Ralph Castain	3d220cbd48	This patch fixes several issues relating to comm_spawn and N1GE. In particular, it does the following: 1. Modifies the RAS framework so it correctly stores and retrieves the actual slots in use, not just those that were allocated. Although the RAS node structure had storage for the number of slots in use, it turned out that the base function for storing and retrieving that information ignored what was in the field and simply set it equal to the number of slots allocated. This has now been fixed. 2. Modified the RMAPS framework so it updates the registry with the actual number of slots used by the mapping. Note that daemons are still NOT counted in this process as daemons are NOT mapped at this time. This will be fixed in 2.0, but will not be addressed in 1.x. 3. Added a new MCA parameter "rmaps_base_no_oversubscribe" that tells the system not to oversubscribe nodes even if the underlying environment permits it. The default is to oversubscribe if needed and the underlying environment permits it. I'm sure someone may argue "why would a user do that?", but it turns out that (looking ahead to dynamic resource reservations) sometimes users won't know how many nodes or slots they've been given in advance - this just allows them to say "hey, I'd rather not run if I didn't get enough". 4. Reorganizes the RMAPS framework to more easily support multiple components. A lot of the logic in the round_robin mapper was very valuable to any component - this has been moved to the base so others can take advantage of it. 5. Added a new test program "hello_nodename" - just does "hello_world" but also prints out the name of the node it is on. 6. Made the orte_ras_node_t object a full ORTE data type so it can more easily be copied, packed, etc. This proved helpful for the RMAPS code reorganization and might be of use elsewhere too. This commit was SVN r10697.	2006-07-10 14:10:21 +00:00
Brian Barrett	41e144c879	Fix for ticket #92 , bproc stdin being borked. The problem was that we were using a pty for everything, which drops all buffered data on the floor when close() is called on the daemon side, meaning EOF has some issues. Instead, do the same thing we do for other starters that use the fork() pls -- use a pipe/fifo for stdin and stderr and a pty for stdout. This is good enough for what we need and avoids most of the issues with ptys. This commit was SVN r10692.	2006-07-08 21:18:24 +00:00
Ralph Castain	bc7690bcb0	Fix the bproc allocator. This is just a bandaid for 1.x that will be fixed more thoroughly in 2.0. Basically, the problem was that the allocator was grabbing everything on the cluster for which the user had access privilege. Thus, if a user had two sessions operable, each with its own allocation, mpirun in each session would grab both sets of nodes and use them. Not very polite. This commit was SVN r10683.	2006-07-06 18:31:14 +00:00

... 4 5 6 7 8 ...

1010 Коммитов