openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	6101050ea6	Remove an abstraction barrier I thought was gone long-ago. The OOB subscription really shouldn't be defined as an OMPI subscription. I know it's just a technicality, but it is time to address such things rather than just letting them continue to propagate. :-) This commit was SVN r12954.	2007-01-02 16:16:50 +00:00
Ralph Castain	90f5e3fad8	Fix a buglet in the singleton startup procedure. For purposes of minimizing the xcast message, we "strip" the descriptive info on all subscription messages. This means, though, that we have to store the process name and other info so it can be retrieved in the body of the subscription data (as opposed to in the description). This wasn't being done for singletons because they don't call the RMAPS to "map" themselves. This has now been corrected. The singleton startup will dutifully call the mapper framework so that the proper data storage locations get initialized. Unfortunately, we then had to instruct the RMAPS not to allocate a vpid range for this job - otherwise, it would make a mistake and think there were two processes in it. Hence, a change was required to RMAPS to tell it "map this job, but don't allocate a vpid range for it". This change will need to migrate across to 1.2 after it "soaks" the appropriate time. This commit was SVN r12952.	2007-01-02 16:14:44 +00:00
Rich Graham	6cb2377015	Change the allocation of the shared memory backing file. The file is allocated on a per comm_world instance, with the lowest rank in comm_world on the given host creating and initializing the file, and then notifying the remaining files via the OOB. Reviewed: Ralph Castain, Brian Barrett Addressing ticket #674. This commit was SVN r12949.	2007-01-01 02:39:02 +00:00
Brian Barrett	b5057d923e	fix compile error that crept in with orte changes. This just makes it compile -- I can't check for correctness on this platform. Someone probably wants to do that. This commit was SVN r12948.	2006-12-31 20:16:20 +00:00
Li-Ta Lo	6df4e80727	new XCPU PLS and SDS to work with libxcpu This commit was SVN r12905.	2006-12-21 00:05:36 +00:00
Brian Barrett	414a87b14c	On OS X, the lowest 8 bits of the exit status are for signals, so pushing rc (which is -1 or 4 if we hit this case) resulted in an odd error that a signal killed the proc (instead of a startup error, as is reality). Instead, use the W_EXITCODE macro (if available) to build up an exit code that has an error code for exit status, but does not make it look like the process died from a signal This commit was SVN r12890.	2006-12-18 02:30:05 +00:00
Ralph Castain	a0ef517550	Fix some errors in the bproc components that prevented compiling. Thought I had already done this, but either those changes were lost when I did the merge, or my old man's memory is fading.... Whaz-at??? :-) This commit was SVN r12874.	2006-12-15 19:40:04 +00:00
Ralph Castain	1e1d0e8a89	Set the app_num attribute into the process environment so we pick it up on the other end This commit was SVN r12868.	2006-12-15 16:43:52 +00:00
Ralph Castain	677d1260aa	cleanup nicely if we don't launch This commit was SVN r12867.	2006-12-15 14:03:53 +00:00
Ralph Castain	cbb660504c	Retain the ability to run valgrind on the bproc launcher - do not call bproc_version if "nolaunch" is specified. This commit was SVN r12866.	2006-12-15 14:01:21 +00:00
Ralph Castain	64ec238b7b	Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs. This commit was SVN r12864.	2006-12-15 02:34:14 +00:00
Brian Barrett	38c2e43ac2	Print out error string rather than errno for TCP-related errors, making it easier for both the user and us to debug issues with BTL and OOB issues... This commit was SVN r12852.	2006-12-14 18:20:43 +00:00
Ralph Castain	3b064a624e	For convenience, revise the orte_job_map_t object so it includes the vpid start/range values, the number of nodes, and the number of processes on each node. These values are all used in various places in the code base - we currently re-compute them multiple times. Since these values do not change and are already being computed by the RMAPS framework, we might as well just save them for re-use. This commit was SVN r12829.	2006-12-12 16:07:23 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Ralph Castain	8314e8dbb9	Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode: 1. no -np provided - put one proc/node across all allocated nodes 2. -np N provided, N > #nodes - we print a pretty error message and exit 3. -np N provided, N <= #nodes - put one proc/node across N nodes I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error. This commit was SVN r12821.	2006-12-11 18:07:07 +00:00
Ralph Castain	0a5d41857a	Complete next round of message size reduction: "strip" the descriptive info from the returned values. I have now added a flag to the gpr address mode (ORTE_GPR_STRIPPED) that instructs the gpr to not include segment names or tokens in the returned gpr_value_t objects. I found only two places that were looking at the tokens: 1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message. 2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was always the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes). This commit was SVN r12813.	2006-12-09 23:10:25 +00:00
Ralph Castain	58569546ed	Fix the fix to remove compiler warning - an incorrect "\" was placed in the command string. This commit was SVN r12805.	2006-12-08 04:17:38 +00:00
Sven Stork	78173a697a	Replace the test opertion "-e" with "-r" to improve the protability. Refs: #392 This commit was SVN r12790.	2006-12-07 12:14:40 +00:00
Ralph Castain	62d7826e01	Helps if we total up the correct field to get the total number of slots in the universe This commit was SVN r12789.	2006-12-07 03:17:12 +00:00
Ralph Castain	a1153fdc8f	Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment. Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available. This commit was SVN r12788.	2006-12-07 03:11:20 +00:00
Ralph Castain	d4bd60c9fe	Restore the paffinity capability, along with all the required logic to ensure we "do the right thing" when the user gives us inaccurate information about the number of slots on a remote node. This commit was SVN r12780.	2006-12-06 15:59:34 +00:00
Ralph Castain	b1e16fffac	Add the C++ doo-hicky stuff around the odls framework definitions just in case somebody, somewhere, on some remote planet where only goats can feed needs it. This commit was SVN r12777.	2006-12-06 13:58:04 +00:00
Ralph Castain	8ca415a0c5	Remove duplicate orte_odls declaration This commit was SVN r12776.	2006-12-06 13:44:41 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Tim Prins	08d5ca821f	Don't get the node architecture when useing the LoadLevleer RAS. It is slow (about a second for ~300 nodes) and we don't even use the value. This commit was SVN r12758.	2006-12-05 13:47:53 +00:00
Ralph Castain	eb941d8ae2	Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag. Kudos to Tim Prins for finding it. This commit was SVN r12757.	2006-12-05 13:04:27 +00:00
Gleb Natapov	f0132b2499	Provide parameters in a correct order (processor/oversubscribed was swapped). This commit was SVN r12737.	2006-12-04 12:55:45 +00:00
George Bosilca	a0ed53d70b	Make the compilers happy. This commit was SVN r12729.	2006-12-03 00:19:11 +00:00
George Bosilca	3fd278c522	Make the tree compile in debug mode. This commit was SVN r12724.	2006-12-01 23:03:09 +00:00
Ralph Castain	897744cdeb	Two major changes to the runtime: 1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1". 2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial". 3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1". 4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer) This commit was SVN r12722.	2006-12-01 22:30:39 +00:00
Jeff Squyres	3cf7dddd47	Fixes trac:635. Ralph identified the problem, I tracked down ''where'' the fd was being closed, and Brian figured out ''why'' (and the fix). What was happening is that a remote process was closing its stdout/stderr and therefore sending a 0-byte IOF message to mpirun. mpirun, in turn, closed the iof endpoint associated with that stream (i.e., stdout/stderr). IOF does this to handle the case where mpirun's stdin is closed -- this therefore causes the stdin on all the ORTE-started processes to have their stdin's closed as well. So the workaround here is to check that if we get a 0-byte IOF message on a sink (indicating a remote closure), and if that sink is the special stdout or stderr stream, don't actually close anything in the local process. This commit was SVN r12691. The following Trac tickets were found above: Ticket 635 --> https://svn.open-mpi.org/trac/ompi/ticket/635	2006-11-28 21:42:49 +00:00
Ralph Castain	0398c9e0c5	Correctly setup the sched_yield when launching processes via the orteds. This still doesn't adjust the yield schedule "on-the-fly" as more procs are dynamically added to a node - it just sets it when they are first launched. This commit was SVN r12683.	2006-11-28 08:27:20 +00:00
George Bosilca	8df8d86b85	Complete the functions to match the expected prototype. This commit was SVN r12680.	2006-11-28 00:44:30 +00:00
Ralph Castain	bc4e97a435	First stage in the move to a faster startup. Change the ORTE stage gate xcast into a binary tree broadcast (away from a linear broadcast). Also, removed the timing report in the gpr_proxy component that printed out the number of bytes in the compound command message as the answer was "not much" - reduces the clutter in the data. This commit was SVN r12679.	2006-11-28 00:06:25 +00:00
Ralph Castain	9bc25f0bec	Fix a potential bug in the registry where it didn't fully check a segment's name when searching for it. Will have to verify that this doesn't break other things. Bring the bproc system close to being back online.... This commit was SVN r12659.	2006-11-23 04:17:37 +00:00
Ralph Castain	deb2470ba3	Move the waitpid callback in the bproc pls after we store the daemon info. Otherwise, a short-lived app could terminate before we store the daemon info, causing mpirun to not terminate the daemons since the call to get_active_daemons would return a NULL list. This commit was SVN r12656.	2006-11-22 22:49:22 +00:00
Ralph Castain	b1ff5fe868	Move the name of the bproc common segment to the central schema location - avoids conflicts when bproc 3 components try to build This commit was SVN r12654.	2006-11-22 20:23:17 +00:00
Ralph Castain	8080034eb2	Clean up a compile issue for bproc This commit was SVN r12653.	2006-11-22 19:50:27 +00:00
Ralph Castain	428c1f14c3	Modify the bproc components to resolve the current allocation problem This commit was SVN r12652.	2006-11-22 19:10:58 +00:00
Sven Stork	dc116d4814	- Add missing mutex lock This commit was SVN r12649.	2006-11-22 13:37:58 +00:00
Ralph Castain	6fca1431f3	Back out some prior commits. These commits fixed bproc so it would run, but broke several other things (singleton comm_spawn and hostfile operations have been identified so far). Since bproc is the culprit here, let's leave bproc broken for now - I'll work on a fix for that environment that doesn't impact everythig else. This commit was SVN r12648.	2006-11-22 13:30:21 +00:00
Brian Barrett	0895f5e08d	Rename OMPI_PROCESS_NAME_{HTON, NTOH} macros to ORTE_PROCESS_NAME_{HTON, NTOH} because they are in ORTE, not OMPI. Also, remove the ORTE_PROCESS_NAME macros in iof base as they are duplicates of the ones that were in ns_types, which meant that bad things happened if you changed what an orte_process_name_t looked like. This commit was SVN r12646.	2006-11-22 03:03:21 +00:00
Ralph Castain	761d4fab25	cleanup unused variables This commit was SVN r12643.	2006-11-21 20:51:39 +00:00
Ralph Castain	a30c65ca24	Fix the allocator to make bproc happy. We were burned again by the fact that the bproc state monitor creates entries on the node segment for all the nodes in the cluster when it is opened during orte_init. As a result, the bjs allocator was never being called, and the system merrily assumed that all nodes in the cluster had been allocated to it. To fix this, I removed a test that had been inserted into the allocation procedure that checked for a non-zero node segment. This was an old artifact - the RAS components already know that they are not to overwrite any existing node segment entries (at least, bproc does - I will check the others. For now, I just want to save the bproc fix on this machine). This commit was SVN r12640.	2006-11-21 19:52:55 +00:00
Ralph Castain	050e401671	Simplify - the cellid is simply a field in the process name. Recently, we decided to just directly access it and get rid of extraneous function calls. This commit was SVN r12638.	2006-11-21 09:59:24 +00:00
Tim Prins	2afb401e39	fix some compile warnings This commit was SVN r12636.	2006-11-21 00:33:10 +00:00
Galen Shipman	c06b740220	a few more opps to get the cellid... This fixes a compilation error on bproc machines.. This commit was SVN r12631.	2006-11-20 19:45:01 +00:00
Ralph Castain	33affed09c	Bring ortehalt to a preliminary capability. It will corectly order a persistent daemon to exit cleanly. Need to now interface it to orterun, clean up a few things here and there This commit was SVN r12626.	2006-11-18 04:47:51 +00:00
Ralph Castain	ea1e0d34c8	Fix the SLURM PLS and SDS components so they correctly communicate the name of the node. This repairs a number of comm_spawn issues seen on Odin and other SLURM machines. This commit was SVN r12621.	2006-11-17 20:51:03 +00:00
Ralph Castain	3c5a2cd17b	Cleanup a few warnings for unused variables This commit was SVN r12620.	2006-11-17 19:32:49 +00:00

1 2 3 4 5 ...

836 Коммитов