openmpi

Автор	SHA1	Сообщение	Дата
Jeff Squyres	6d0d8848ac	Fix CID 1129: Remove variable that is set but never used. This commit was SVN r20194.	2009-01-03 15:39:51 +00:00
Ralph Castain	7787f84540	Per the earlier RFC and some discussion at the Dec ORTE design meeting, add the ompi-top tool and all its supporting infrastructure. This includes a new OPAL pstat framework and data type, currently with rather weak support for Mac OSX and pretty complete support for Linux. The Sun team promised to add Solaris support as well. Also, per chat with Jeff, modified the Makefile.am's of a few orte tools so that they were consistent in the way we generate the ompi-equivalent cmds. This commit was SVN r20165.	2008-12-22 20:23:05 +00:00
Ralph Castain	728a24c8ec	After considerable patience and help with debugging/testing from Tim M and Jeff S, return a completed and pretty well tested patch of the IOF to the trunk. This commit includes the previously reverted r20074, r20068, and r20064, as well as changes to fix those commits. Basically, the remaining problem turned out to be: 1. closing stdout/stderr during orte_finalize of mpirun 2. inadvertently setting up a write event on fd = -1 3. devising a scheme to more accurately track when the stdin write event was active vs closed so it only got released once This passed prelim MTT testing by Jeff and Tim, but should soak for awhile before migrating to 1.3. This commit was SVN r20106. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-10 20:40:47 +00:00
Ralph Castain	9d7cb82bba	Modify the daemon cmd processor to relay and then process the cmd locally. We couldn't do this before due to the daemon's needing to update contact info prior to doing the relay. However, the new routed system plus the inclusion of the nidmap in the launch message now makes this possible. It is a small launch performance improvement as now we relay the launch cmd across to the next daemon before taking the time to launch our own local procs. Still, it does allow more parallel operations during the launch procedure. This commit was SVN r20104.	2008-12-10 19:18:36 +00:00
Ralph Castain	e28210d0dc	Revert r20074, r20068, and r20064: remove the IOF proc completion code pending further off-trunk work. This commit was SVN r20089. The following SVN revision numbers were found above: r20064 --> open-mpi/ompi@a07660aea8 r20068 --> open-mpi/ompi@ec930d14a9 r20074 --> open-mpi/ompi@2940309613	2008-12-09 17:11:59 +00:00
Ralph Castain	2940309613	Attempt to solve a race condition showing up in some MTT runs. There were three entry points for proc termination info into the ODLS: 1. a direct callback from waitpid - this set the waitpid_fired flag 2. a notify event callback from the IOF - this set the iof complete flag 3. a message via the daemon cmd processor from the proc "de-registering" the sync, thus indicating it was going through MPI_Finalize. The problem is that these could overlap, with the first two allowing the orted to declare the proc complete before the daemon had responded to #3. This change forces all three events to flow through the daemon cmd processor, thus ensuring an ordered handling. I'm not certain this will solve the problem, but will await further MTT reports to see. Unfortunately, the problem doesn't show up on any manual or script-based tests I have been able to run, even when I duplicate the exact cmd that fails under MTT. This commit was SVN r20074.	2008-12-05 04:20:00 +00:00
Ralph Castain	a07660aea8	Bring over the IOF completion changes. This commit fixes the long-occurring problem whereby application procs could, under some circumstances, lose their final prints to stdout/err. The commit includes: 1. coordination of job completion notification to include a requirement for both waitpid detection AND notification that all iof pipes have been closed by the app 2. change of all IOF read and write events to be non-persistent so they can properly be shutdown and restarted only when required 3. addition of a delay (currently set to 10ms) before restarting the stdin read event. This was required to ensure that the stdout, stderr, and stddiag read events had an opportunity to be serviced in scenarios where large files are attached to stdin. This commit was SVN r20064.	2008-12-03 17:45:42 +00:00
Ralph Castain	9a57db4a81	To support comm_spawn in fully routed environments, daemons need to know the route to all procs in their job family. They already had this information, but were not retaining it. The infrastructure to do so has existed for some time - just never had the time to complete it. This commit does that by ensuring that daemons retain knowledge of proc location for all procs in their job family. It required a minor change to the ESS API to allow the daemons to update their pidmaps as data was received. In addition, the routed modules have been updated to take advantage of the newly available info, and the encode/decode pidmap utilities have been updated to communicate the required info in the launch message. This commit was SVN r20022.	2008-11-18 15:35:50 +00:00
Ralph Castain	68423f7544	Partially restore the iof changes - this repairs the initial observation of inconsistent and incomplete output This commit was SVN r19999.	2008-11-14 20:36:18 +00:00
Ralph Castain	586334d1c8	Per discussion with Tim Mattox, reset the trunk to pre-19991 level for the iof only. I will shortly add a changeset that will repair the one known error where we were incorrectly closing the stdout/err/diag file descriptors when all we wanted to do was close stdin. I will leave out the changes associated with coordinating proc termination due to race conditions IU encounted during MTT testing. I have been unable to replicate those so far, but we hope to resolve it in the near future. This commit was SVN r19998.	2008-11-14 20:22:36 +00:00
Ralph Castain	555bbf0c02	Fix the iof race conditions wrt proc termination. This is comprised of two sections: 1. modify the iof to track when a proc actually closes all of its open iof output pipes. When this occurs, notify the odls that the proc's iof is complete. This is done via a zero-time event so that we can step out of the read event before processing the notification. 2. in the odls, modify the waitpid callback so it only flags that it was called. Add a function to receive the iof-complete notification, and a function that checks for both iof complete and waitpid callback before declaring a proc fully terminated. This ensures that we read and deliver -all- of the IO prior to declaring the job complete. Also modified the odls call to orte_iof.close (and the component's implementation) so it only closes stdin, leaving the other io channels alone. This fixes the other half of the known problem. This should fix the ticket on this subject, but I'll wait to close it pending further testing in the trunk. This commit was SVN r19991.	2008-11-12 23:32:01 +00:00
Ralph Castain	5889dcd30b	Fix a warning reported by Jeff that actually could cause singleton operations to fail. Ensure that the byte object used to init the job map for singleton's is properly initialized. This commit was SVN r19957.	2008-11-08 01:09:06 +00:00
George Bosilca	d23fe1bb10	Include Ralph's suggestions, i.e. keep the hnp and orted management in sync. This commit was SVN r19872.	2008-11-01 00:39:46 +00:00
George Bosilca	9528d33e90	Nothing relevant, few indentations and replace tab by spaces. This commit was SVN r19870.	2008-10-31 22:24:52 +00:00
George Bosilca	ebe87d1842	Apply some suggestions from Ralph and avoid a pretty nasty race condition on the close of the fd. The problem was that we close the same fd twice, and that meantime the fd could have been reassigned to some other file or socket. This commit was SVN r19869.	2008-10-31 22:23:53 +00:00
Ralph Castain	f54fda489e	This is a first step towards supporting fully-routed OOB communications: 1. remove direct routed module (hooray!) 2. add radix tree routed module (binomial remains default) 3. remove duplicate data storage - orteds were storing nidmap and pidmap data in odls, everyone else in ess 4. add ess APIs to update nidmap, add new pidmap - used only by orteds for MPI-2 support 5. modify code to eliminate multiple calls to orte_routed.update_route that recreated info already in ess pidmap. Add ess API to lookup that info instead. Modify routed modules to utilize that capability 6. setup new ability to shutdown orteds without sending back an "ack" message to mpirun - not utilized yet, will require some changes to plm terminate_orteds functions in managed environments (coming soon) Initial tests indicating that fully routing comm via defined routing trees may not actually have a significant cost for operations like IB QP setup. More tests required to confirm. This will require an autogen... This commit was SVN r19866.	2008-10-31 21:10:00 +00:00
George Bosilca	0ce76248e8	Close the file descriptors used to push or pull the data to the children. Without this patch, doing spawn in a loop ended up by exhausting all available file descriptors pretty quickly. There were about 5 file descriptors opened per spawned process. Now the number of file descriptors managed by the process (orted or HNP) is a lot smaller. This commit was SVN r19864.	2008-10-31 18:05:28 +00:00
Ralph Castain	6e5d844c36	Roll in the revamped IOF subsystem. Per the devel mailing list email, this is a complete rewrite of the iof framework designed to simplify the code for maintainability, and to support features we had planned to do, but were too difficult to implement in the old code. Specifically, the new code: 1. completely and cleanly separates responsibilities between the HNP, orted, and tool components. 2. removes all wireup messaging during launch and shutdown. 3. maintains flow control for stdin to avoid large-scale consumption of memory by orteds when large input files are forwarded. This is done using an xon/xoff protocol. 4. enables specification of stdin recipients on the mpirun cmd line. Allowed options include rank, "all", or "none". Default is rank 0. 5. creates a new MPI_Info key "ompi_stdin_target" that supports the above options for child jobs. Default is "none". 6. adds a new tool "orte-iof" that can connect to a running mpirun and display the output. Cmd line options allow selection of any combination of stdout, stderr, and stddiag. Default is stdout. 7. adds a new mpirun and orte-iof cmd line option "tag-output" that will tag each line of output with process name and stream ident. For example, "[1,0]<stdout>this is output" This is not intended for the 1.3 release as it is a major change requiring considerable soak time. This commit was SVN r19767.	2008-10-18 00:00:49 +00:00
Ralph Castain	15c47a2473	Revise the daemon collective system to handle comm_spawn patterns that cross into new nodes that are not direct children on the routing tree of the HNP. Refers to ticket #1548. Although this appears to fix the problem, the ticket will be held open pending further test prior to transition to the 1.3 branch. This commit was SVN r19674.	2008-10-02 20:08:27 +00:00
Ralph Castain	037231fbcb	MOdify the node_rank and local_rank fields to be uint16_t so we can handle more than 256 procs/node. Change the type to a defined one so that any future change can be easily done, if required. This commit was SVN r19637.	2008-09-25 13:39:08 +00:00
Shiqing Fan	04ee20a880	- Mainly type casts. Microsoft VC++ compiler is too strict. This commit was SVN r19517.	2008-09-08 15:39:30 +00:00
Ralph Castain	4ef9d15d97	Revamp the opal mca paffinity interface. We ran into a problem when we encountered machines that had "holes" in their physical processor layout - e.g., machines that supported "hotplugging", or that had unpopulated sockets. To solve that problem, we had to clarify at the API level where we were describing physical vs logical processor info, and then translate accordingly in the underlying implementation. See opal/mca/paffinity/paffinity.h for explanation as to the physical vs logical nature of the params used in the API. Fixes trac:1435 This commit was SVN r19391. The following Trac tickets were found above: Ticket 1435 --> https://svn.open-mpi.org/trac/ompi/ticket/1435	2008-08-21 19:21:28 +00:00
Ralph Castain	913cf04633	Only co-locate debugger daemon if the orted has local children - prevents mpirun from co-locating a daemon when it has no local procs This commit was SVN r19280.	2008-08-13 20:06:28 +00:00
Ralph Castain	30f37f762d	Enable co-location of debugger daemons during initial launch and when debugging a running job. Provide support for four MPIR extensions that allow specification of debugger daemon executable, argv for the debugger daemon, whether or not to forward debugger daemon IO, and whether or not debugger daemon will piggy-back on ORTE OOB network. Last is not yet implemented. No change in behavior or operation occurs unless (a) the debugger specifically utilizes the extensions and, for co-locate while running, the user specifically enables the capability via an MCA param. Two of the MPIR extensions supported here are used in a widely-used debugger for a large-scale installation. The other two extensions are new and being utilized in prototype work by several debuggers for possible future release. This commit was SVN r19275.	2008-08-13 17:47:24 +00:00
Ralph Castain	0735d6f1c2	This commit fixes ticket #1414 Cleanup the logic in the odls for when processes terminate. It turns out that we were only going through the kill_proc logic once instead of looping over all local children when we ordered a daemon to kill its local procs. This went unnoticed for some time as for most systems the local procs were terminated anyway when the daemon terminated due to the parent/child relationship. Solaris is apparently different - the children are not automatically terminated when the parent dies. As a result, it acts as a detector for this bug. Mucho thanks to Rolf V. for his help in debugging - and to IM for letting me follow his gdb progress in quasi real-time! This commit was SVN r19044.	2008-07-26 02:54:43 +00:00
Ralph Castain	fdb2408bf2	Rename the osx paffinity component the "posix" component since it really has nothing osx specific in it - it is just a generic posix call to determine #processors. Set the priority low so that both linux and solaris components override it if they build. It shouldn't build in Windows at all. Modify the odls to remove a (size_t) typecast in front of the num_processors variable just in case it is returned negative. This usually is accompanied by an opal_error, so this shouldn't make any difference - but it is more technically correct. This commit was SVN r19008.	2008-07-24 01:54:51 +00:00
Lenny Verkhovsky	b4d54dda57	Fixed possible seqf when using RANKFILE, but not all ranks assigned Fixed allocation of all ranks when using RANKFILE, but not all ranks assigned Aborting if using RANKFILE, but np wasn't specified a little earlier Clean mca_rmaps_rank_file_component.debug This commit was SVN r19004.	2008-07-23 17:44:02 +00:00
Ralph Castain	e3c3d28bf1	Add some more debugging to tell us how many processors were found when setting sched_yield This commit was SVN r18999.	2008-07-23 15:28:51 +00:00
Ralph Castain	83e7c19d33	Remove deprecated function - this was incorporated into the paffinity framework a long time ago. Fortunately, nobody was actually using it! This commit was SVN r18990.	2008-07-23 03:43:31 +00:00
Ralph Castain	6135943382	Update the paffinity call in the ODLS so we retrieve the number of processors on the local node, thus allowing us to correctly set the sched_yield parameter. This commit was SVN r18946.	2008-07-18 19:19:16 +00:00
Lenny Verkhovsky	a812324963	Fixing "paffinity_base_slot_list" environment This commit was SVN r18900.	2008-07-14 07:10:50 +00:00
Jeff Squyres	583bf425c0	Fixes trac:1383: Short version: remove opal_paffinity_alone and restore mpi_paffinity_alone. ORTE makes various information available for the MPI layer to decide what it wants to do in terms of processor affinity. Details: * remove opal_paffinity_alone MCA param; restore mpi_paffinity_alone MCA param * move opal_paffinity_slot_list param registration to paffinity base * ompi_mpi_init() calls opal_paffinity_base_slot_list_set(); if that succeeds use that. If no slot list was set, see if mpi_paffinity_alone was set. If so, bind this process to its Node Local Rank (NLR). The NLR is the ORTE-maintained slot ID; if you COMM_SPAWN to a host in this ORTE universe that already has procs on it, the NLR for the new job will start at N (not 0). So this is slightly better than mpi_paffinity_alone in the v1.2 series. * If a slot list is specified and mpi_paffinity_alone is set, we display an error and abort. * Remove calls from rmaps/rank_file component to register and lookup opal_paffinity mca params. * Remove code in orte/odls that set affinities - instead, have them just pass a slot_list if it exists. * Cleanup the orte/odls code that determined oversubscribed/want_processor as these were just opposites of each other. This commit was SVN r18874. The following Trac tickets were found above: Ticket 1383 --> https://svn.open-mpi.org/trac/ompi/ticket/1383	2008-07-10 21:12:45 +00:00
Ralph Castain	ba5498cdc6	Repair the MPI-2 dynamic operations. This includes: 1. repair of the linear and direct routed modules 2. repair of the ompi/pubsub/orte module to correctly init routes to the ompi-server, and correctly handle failure to correctly parse the provided ompi-server URI 3. modification of orterun to accept both "file" and "FILE" for designating where the ompi-server URI is to be found - purely a convenience feature 4. resolution of a message ordering problem during the connect/accept handshake that allowed the "send-first" proc to attempt to send to the "recv-first" proc before the HNP had actually updated its routes. Let this be a further reminder to all - message ordering is NOT guaranteed in the OOB 5. Repair the ompi/dpm/orte module to correctly init routes during connect/accept. Reminder to all: messages sent to procs in another job family (i.e., started by a different mpirun) are ALWAYS routed through the respective HNPs. As per the comments in orte/routed, this is REQUIRED to maintain connect/accept (where only the root proc on each side is capable of init'ing the routes), allow communication between mpirun's using different routing modules, and to minimize connections on tools such as ompi-server. It is all taken care of "under the covers" by the OOB to ensure that a route back to the sender is maintained, even when the different mpirun's are using different routed modules. 6. corrections in the orte/odls to ensure proper identification of daemons participating in a dynamic launch 7. corrections in build/nidmap to support update of an existing nidmap during dynamic launch 8. corrected implementation of the update_arch function in the ESS, along with consolidation of a number of ESS operations into base functions for easier maintenance. The ability to support info from multiple jobs was added, although we don't currently do so - this will come later to support further fault recovery strategies 9. minor updates to several functions to remove unnecessary and/or no longer used variables and envar's, add some debugging output, etc. 10. addition of a new macro ORTE_PROC_IS_DAEMON that resolves to true if the provided proc is a daemon There is still more cleanup to be done for efficiency, but this at least works. Tested on single-node Mac, multi-node SLURM via odin. Tests included connect/accept, publish/lookup/unpublish, comm_spawn, comm_spawn_multiple, and singleton comm_spawn. Fixes ticket #1256 This commit was SVN r18804.	2008-07-03 17:53:37 +00:00
Ralph Castain	0fa9d88009	Set $PWD for the application proc to match the cwd. If the user specifies a working dir via -wdir, this ensures that the enviro variable matches what they get from getcwd. Note that any subsequent calls to chdir in the user's program will break that equivalence - we can only ensure it starts out matching! This commit was SVN r18709.	2008-06-23 18:25:41 +00:00
Ralph Castain	3b5e80fa61	Shift responsibility for preconnecting the oob to the orte routed framework, which is the only place that knows what needs to be done. Only the direct module will actually do anything - it uses the same algo as the original preconnect function. This commit was SVN r18677.	2008-06-19 13:48:26 +00:00
Ralph Castain	0532d799d6	Complete implementation of the --without-rte-support configure option. Working with Brian, this has been tested on RedStorm. Some minor changes to help facilitate debugger support so that both mpirun and yod can operate with it. Still to be completed. This commit was SVN r18664.	2008-06-18 03:15:56 +00:00
Ralph Castain	a87aa442e3	Remove last remaining reference to iof_flush - it was #if'd out anyway. The existing flush code appears to have several critical problems. Given the impending rework of the IOF subsystem, there is no point in trying to fix it here. This commit was SVN r18649.	2008-06-11 16:25:46 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Pak Lui	7f7777a538	Check for NULL in prefix_dir. This commit fixes trac:1337. This commit was SVN r18612. The following Trac tickets were found above: Ticket 1337 --> https://svn.open-mpi.org/trac/ompi/ticket/1337	2008-06-06 19:55:01 +00:00
Josh Hursey	1de50b523c	Fix some Coverity 'Event set_but_not_used' highlights. Thanks to Jeff for bringing them to my attention. This commit was SVN r18606.	2008-06-06 14:38:41 +00:00
George Bosilca	25ae9c12e6	Silence few warnings. This commit was SVN r18568.	2008-06-03 19:58:40 +00:00
Ralph Castain	c992e99035	Remove the tags from orte_output_open and the filtering operation from orte_output - this will be handled differently to improve the XML output interface This commit was SVN r18557.	2008-06-03 14:24:01 +00:00
Ralph Castain	f76240e7cc	Modify the nidmap utility to pass daemon vpids for nodes. In some mapping algo's, it is possible for nodes to be skipped. This results in daemon vpids that differ from the index of their respective node in the node array, causing the daemon to not recognize procs that it is supposed to launch. This commit was SVN r18528.	2008-05-28 18:38:47 +00:00
Ralph Castain	0b2b655de5	Initialize a variable so it can correctly be dealt with at shutdown - fixes trac:1312 This commit was SVN r18505. The following Trac tickets were found above: Ticket 1312 --> https://svn.open-mpi.org/trac/ompi/ticket/1312	2008-05-27 14:53:24 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Ralph Castain	ac5263613c	Fix stupid singletons yet again This commit was SVN r18408.	2008-05-07 20:26:31 +00:00
Ralph Castain	d97a4f880d	Shift the daemon collective operation to the ODLS framework. Ensure we track the collectives per job to avoid race conditions. Take advantage of the new capabilities of the routed framework to define aggregating trees for the daemon collective, and to track which daemons are participating to handle the case of sparse participation. Make it all work with comm_spawn in the case of all procs on previously occupied nodes, some new procs on new nodes, and mixtures of the two. Note: comm_spawn now works with both binomial and linear routed modules. There remains a problem of spawned procs not properly getting updated contact info for the parent proc when run in the direct routed mode...but that's for another day. This commit was SVN r18385.	2008-05-06 20:16:17 +00:00
Josh Hursey	9971bc9d95	Merge in the mca_base_select changes per RFC: http://www.open-mpi.org/community/lists/devel/2008/04/3779.php {{{ svn merge -r 18276:18380 https://svn.open-mpi.org/svn/ompi/tmp-public/jjh-mca-play . }}} Any components not in the trunk, but in one of the effected frameworks must be updated. Contact the list, look at the RFC, or look at the diff for how to do this. Sorry for the early commit of this, but I wanted to get it in today (per RFC) and didn't know if I would have a chance later today. This commit was SVN r18381.	2008-05-06 18:08:45 +00:00
Ralph Castain	40904dd152	Add a binomial routed module - for now, still completely wires up the daemons, but that will be changed later. Modify grpcomm xcast so it now uses the selected routed module - eliminates cross-wiring of xcast and routing paths. Suboptimal at the moment, but better implementation is on its way. Cleanup ignore properties on the new routed components. This commit was SVN r18377.	2008-05-05 22:32:25 +00:00
Ralph Castain	8e846bf7f2	Separate the gathering of collective data by jobid This commit was SVN r18357.	2008-05-02 12:00:08 +00:00

1 2 3

111 Коммитов