openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	1e1d0e8a89	Set the app_num attribute into the process environment so we pick it up on the other end This commit was SVN r12868.	2006-12-15 16:43:52 +00:00
Ralph Castain	677d1260aa	cleanup nicely if we don't launch This commit was SVN r12867.	2006-12-15 14:03:53 +00:00
Ralph Castain	cbb660504c	Retain the ability to run valgrind on the bproc launcher - do not call bproc_version if "nolaunch" is specified. This commit was SVN r12866.	2006-12-15 14:01:21 +00:00
Ralph Castain	64ec238b7b	Repair support for Bproc 4 on 64-bit systems. Update the SMR framework to actually support the begin_monitoring API. Implement the get/set_node_state APIs. This commit was SVN r12864.	2006-12-15 02:34:14 +00:00
Brian Barrett	38c2e43ac2	Print out error string rather than errno for TCP-related errors, making it easier for both the user and us to debug issues with BTL and OOB issues... This commit was SVN r12852.	2006-12-14 18:20:43 +00:00
Ralph Castain	7b8f445e13	Modify the "--display-map-at-launch" option to just "--display-map". Now that we have a "--do-not-launch" option, the "-at-launch" part of the display-map option was confusing. "--display-map" displays the resulting process map before we launch anyway, so this is clearer. This commit was SVN r12840.	2006-12-13 13:49:15 +00:00
Ralph Castain	82946cb220	Add a new option to orterun: "--do-not-launch" directs the system to do the allocation, map, job setup, etc., but don't actually launch the job. This lets us test all the setup portions of the code. Also, take the first step in updating how we handle mca params in ORTE - bring it closer to how it is done in the other two layers. Much more work to be done here. This commit was SVN r12838.	2006-12-13 04:51:38 +00:00
Ralph Castain	3b064a624e	For convenience, revise the orte_job_map_t object so it includes the vpid start/range values, the number of nodes, and the number of processes on each node. These values are all used in various places in the code base - we currently re-compute them multiple times. Since these values do not change and are already being computed by the RMAPS framework, we might as well just save them for re-use. This commit was SVN r12829.	2006-12-12 16:07:23 +00:00
Ralph Castain	28ce8e5e5e	Extend the mpirun options to support "--npernode N". This option tells the system to spawn N procs/node across all nodes in the allocation. If N is greater than the number of allocated slots, then the usual oversubscription logic will apply (i.e., the system will error out if oversubscription is not allowed, otherwise it will run with the sched_yield set to non-aggressive behavior). In "--npernode" operation, the "-np" command line parameter is ignored. This commit was SVN r12826.	2006-12-12 00:54:05 +00:00
Ralph Castain	8314e8dbb9	Modify the pernode option so it can accept a request for the number of processes to be launched. We now check three use-cases for pernode: 1. no -np provided - put one proc/node across all allocated nodes 2. -np N provided, N > #nodes - we print a pretty error message and exit 3. -np N provided, N <= #nodes - put one proc/node across N nodes I also added a new orte constant (ORTE_ERR_SILENT) that allows us to pass up the chain that an error was encountered, but NOT print ORTE_ERROR_LOG messages. This is intended to be used for cases where the error we encounter is NOT an orte error, but rather is one associated with incorrect user input (e.g., the preceding case 2). In such cases, there is no point in printing an ORTE_ERROR_LOG chain of messages as it isn't an orte error. This commit was SVN r12821.	2006-12-11 18:07:07 +00:00
Ralph Castain	0a5d41857a	Complete next round of message size reduction: "strip" the descriptive info from the returned values. I have now added a flag to the gpr address mode (ORTE_GPR_STRIPPED) that instructs the gpr to not include segment names or tokens in the returned gpr_value_t objects. I found only two places that were looking at the tokens: 1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message. 2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was always the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes). This commit was SVN r12813.	2006-12-09 23:10:25 +00:00
Ralph Castain	58569546ed	Fix the fix to remove compiler warning - an incorrect "\" was placed in the command string. This commit was SVN r12805.	2006-12-08 04:17:38 +00:00
Sven Stork	78173a697a	Replace the test opertion "-e" with "-r" to improve the protability. Refs: #392 This commit was SVN r12790.	2006-12-07 12:14:40 +00:00
Ralph Castain	62d7826e01	Helps if we total up the correct field to get the total number of slots in the universe This commit was SVN r12789.	2006-12-07 03:17:12 +00:00
Ralph Castain	a1153fdc8f	Eliminate virtually all of the attribute_predefined data from the STG1 message. We now compute the total number of slots allocated to us and save that in the registry - the attributed_predefined then retrieves it via the STG1 message. The app_num is passed via the process_info structure, which gets the value from the ODLS in the environment. Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available. This commit was SVN r12788.	2006-12-07 03:11:20 +00:00
Brian Barrett	8f68764e5e	A number of heterogeneous fixes for the dss with the new buffer options: * When using the load/unload interface, stash away the current buffer type so that it can be properly unpacked on the receiving side if the buffer type is other than the receiver default * Include type information for unsized types (bool, int, size_t, pid_t) so that they can be properly unpacked by the receiver in the heterogeneous case. * Restore the NON_DESC type as the default for optimized builds, since it looks like this fixes the known issues with the non-described buffers Refs trac:587 This commit was SVN r12784. The following Trac tickets were found above: Ticket 587 --> https://svn.open-mpi.org/trac/ompi/ticket/587	2006-12-06 23:19:06 +00:00
Brian Barrett	cfeac5581a	temporarily always use described buffers as the non-described causes all kinds of problems for heterogeneous environments This commit was SVN r12783.	2006-12-06 20:22:31 +00:00
Ralph Castain	d4bd60c9fe	Restore the paffinity capability, along with all the required logic to ensure we "do the right thing" when the user gives us inaccurate information about the number of slots on a remote node. This commit was SVN r12780.	2006-12-06 15:59:34 +00:00
Ralph Castain	b1e16fffac	Add the C++ doo-hicky stuff around the odls framework definitions just in case somebody, somewhere, on some remote planet where only goats can feed needs it. This commit was SVN r12777.	2006-12-06 13:58:04 +00:00
Ralph Castain	8ca415a0c5	Remove duplicate orte_odls declaration This commit was SVN r12776.	2006-12-06 13:44:41 +00:00
Brian Barrett	6f8b366acb	Rename liborte to libopen-rte and libopal to libopen-pal per telecon today and bug #632. Refs trac:632 This commit was SVN r12762. The following Trac tickets were found above: Ticket 632 --> https://svn.open-mpi.org/trac/ompi/ticket/632	2006-12-05 18:27:24 +00:00
Tim Prins	08d5ca821f	Don't get the node architecture when useing the LoadLevleer RAS. It is slow (about a second for ~300 nodes) and we don't even use the value. This commit was SVN r12758.	2006-12-05 13:47:53 +00:00
Ralph Castain	eb941d8ae2	Fix a bug that declared a node as "oversubscribed" a little early during the mapper procedure. This only affected the mapping procedure, and only if you had set the "--no-oversubscribe" flag. Kudos to Tim Prins for finding it. This commit was SVN r12757.	2006-12-05 13:04:27 +00:00
George Bosilca	6f28bcdc21	Remove the last set of compiler warnings from the precondition file. This commit was SVN r12753.	2006-12-04 21:45:57 +00:00
Brian Barrett	d64fa194f1	Instead of continually screwing around with different format strings to make this warning-proof, loop over the uint64_ts as an array of integers and use %x. The final string is just as random and formatted exactly the same, so we're all good in that department. Refs trac:655 This commit was SVN r12742. The following Trac tickets were found above: Ticket 655 --> https://svn.open-mpi.org/trac/ompi/ticket/655	2006-12-04 18:07:24 +00:00
Gleb Natapov	f0132b2499	Provide parameters in a correct order (processor/oversubscribed was swapped). This commit was SVN r12737.	2006-12-04 12:55:45 +00:00
Rainer Keller	d078bb3e8a	- Revert changes and include pointers to discussion. This commit was SVN r12736.	2006-12-03 17:05:15 +00:00
Rainer Keller	e61dd8722e	- Silence compiler on ORTE_TRANSPORT_KEY_FMT, it is fixed to llx - No functional changes, just indentation and corrections to error output. This commit was SVN r12734.	2006-12-03 13:59:23 +00:00
George Bosilca	a0ed53d70b	Make the compilers happy. This commit was SVN r12729.	2006-12-03 00:19:11 +00:00
Ralph Castain	4151a46871	Per Jeff's request (which made a lot of sense), setup the default buffer type to be DESCRIBED for debug/devel builds, and NON-DESC for optimized builds. The user can still select the default buffer type via mca parameter at runtime - this just sets the default default. :-) Also, change the dss buffer type mca param to something more easily remembered (it is now "dss_buffer_type"). Heck, even I had to keep looking at the darn code to remember it. This commit was SVN r12728.	2006-12-02 13:32:16 +00:00
George Bosilca	3fd278c522	Make the tree compile in debug mode. This commit was SVN r12724.	2006-12-01 23:03:09 +00:00
Ralph Castain	897744cdeb	Two major changes to the runtime: 1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1". 2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial". 3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1". 4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer) This commit was SVN r12722.	2006-12-01 22:30:39 +00:00
Jeff Squyres	3cf7dddd47	Fixes trac:635. Ralph identified the problem, I tracked down ''where'' the fd was being closed, and Brian figured out ''why'' (and the fix). What was happening is that a remote process was closing its stdout/stderr and therefore sending a 0-byte IOF message to mpirun. mpirun, in turn, closed the iof endpoint associated with that stream (i.e., stdout/stderr). IOF does this to handle the case where mpirun's stdin is closed -- this therefore causes the stdin on all the ORTE-started processes to have their stdin's closed as well. So the workaround here is to check that if we get a 0-byte IOF message on a sink (indicating a remote closure), and if that sink is the special stdout or stderr stream, don't actually close anything in the local process. This commit was SVN r12691. The following Trac tickets were found above: Ticket 635 --> https://svn.open-mpi.org/trac/ompi/ticket/635	2006-11-28 21:42:49 +00:00
Ralph Castain	0398c9e0c5	Correctly setup the sched_yield when launching processes via the orteds. This still doesn't adjust the yield schedule "on-the-fly" as more procs are dynamically added to a node - it just sets it when they are first launched. This commit was SVN r12683.	2006-11-28 08:27:20 +00:00
George Bosilca	8df8d86b85	Complete the functions to match the expected prototype. This commit was SVN r12680.	2006-11-28 00:44:30 +00:00
Ralph Castain	bc4e97a435	First stage in the move to a faster startup. Change the ORTE stage gate xcast into a binary tree broadcast (away from a linear broadcast). Also, removed the timing report in the gpr_proxy component that printed out the number of bytes in the compound command message as the answer was "not much" - reduces the clutter in the data. This commit was SVN r12679.	2006-11-28 00:06:25 +00:00
Ralph Castain	652b91ee26	Remove some compiler warnings This commit was SVN r12678.	2006-11-27 23:47:36 +00:00
Ralph Castain	9bc25f0bec	Fix a potential bug in the registry where it didn't fully check a segment's name when searching for it. Will have to verify that this doesn't break other things. Bring the bproc system close to being back online.... This commit was SVN r12659.	2006-11-23 04:17:37 +00:00
Brian Barrett	32833deff0	since orteboot, ortehalt, and ortekill were all added today (including to configure.ac), we need to add them to SUBDIRS to make them end up in the tarball as well... This commit was SVN r12658.	2006-11-23 03:10:57 +00:00
Ralph Castain	deb2470ba3	Move the waitpid callback in the bproc pls after we store the daemon info. Otherwise, a short-lived app could terminate before we store the daemon info, causing mpirun to not terminate the daemons since the call to get_active_daemons would return a NULL list. This commit was SVN r12656.	2006-11-22 22:49:22 +00:00
Rainer Keller	b63500f62c	- Dont unlock ompi_rte_mutex unconditionally, use the macro instead. This commit was SVN r12655.	2006-11-22 21:01:43 +00:00
Ralph Castain	b1ff5fe868	Move the name of the bproc common segment to the central schema location - avoids conflicts when bproc 3 components try to build This commit was SVN r12654.	2006-11-22 20:23:17 +00:00
Ralph Castain	8080034eb2	Clean up a compile issue for bproc This commit was SVN r12653.	2006-11-22 19:50:27 +00:00
Ralph Castain	428c1f14c3	Modify the bproc components to resolve the current allocation problem This commit was SVN r12652.	2006-11-22 19:10:58 +00:00
Ralph Castain	7f95b27141	Correctly "hide" the new orte tools - they shouldn't get compiled or seen unless you specifically go into those subdirectories and manually do a "make". This commit was SVN r12650.	2006-11-22 14:35:16 +00:00
Sven Stork	dc116d4814	- Add missing mutex lock This commit was SVN r12649.	2006-11-22 13:37:58 +00:00
Ralph Castain	6fca1431f3	Back out some prior commits. These commits fixed bproc so it would run, but broke several other things (singleton comm_spawn and hostfile operations have been identified so far). Since bproc is the culprit here, let's leave bproc broken for now - I'll work on a fix for that environment that doesn't impact everythig else. This commit was SVN r12648.	2006-11-22 13:30:21 +00:00
Brian Barrett	0895f5e08d	Rename OMPI_PROCESS_NAME_{HTON, NTOH} macros to ORTE_PROCESS_NAME_{HTON, NTOH} because they are in ORTE, not OMPI. Also, remove the ORTE_PROCESS_NAME macros in iof base as they are duplicates of the ones that were in ns_types, which meant that bad things happened if you changed what an orte_process_name_t looked like. This commit was SVN r12646.	2006-11-22 03:03:21 +00:00
Brian Barrett	33320b7165	Rework the opal_progress interface to better support dynamic processes and at the same time, remove some of the MPI-related options from OPAL: - provide mechanism to change at runtime whether sched_yield() should be called when the progress engine is idle - provide mechanism for changing the rate at which the event engine is called when there are "no" users of the event engine (ie, when using MPI but not TCP) - fix some function names in the progress engine to better match their intended use (and remove MPI naming scheme) - remove progress_mpi_enable / progress_mpi_disable because we can now use the functions to set the sched_yield and tick rate interfaces - rename opal_progress_events() to opal_progress_set_event_flag() because the first really isn't descriptive of what the function does and I always got confused by it This commit was SVN r12645.	2006-11-22 02:06:52 +00:00
Ralph Castain	9f3dcd147a	Round and round the mulberry bush we go... Fix comm_spawn by singletons. orte_init does some voodoo to let the system know about localhost when we are a singleton. This includes allocating it so that any comm_spawn'd children can use their parent "allocation". Unfortunately, the fix that bproc needs (due to that smr filling up the node segment!) causes the singleton startup to fail. The fix is to just have the singleton startup force an allocation of its localhost. Only issue here is: what happens if we are in a persistent universe? The singleton will now overwrite any prior info on slots used on localhost by other jobs (won't affect anything else). The answer, of course, is to do something more intelligent - lookup localhost on the registry and just update its info instead of overwriting it. Something for another day (or month....or year) This commit was SVN r12644.	2006-11-21 21:51:58 +00:00

1 2 3 4 5 ...

1049 Коммитов