openmpi

Автор	SHA1	Сообщение	Дата
Ralph Castain	4fff584a68	Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that did start. The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system. Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed. Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief. With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn. Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put". This commit was SVN r14711.	2007-05-21 18:31:28 +00:00
Ralph Castain	180c96bb8f	Clear an erroneous error message pending a more complete fix This commit was SVN r14698.	2007-05-18 14:44:27 +00:00
Tim Prins	8e7765e456	Fix a gigantic memory leak. We were copying a message to send into a buffer, then never freeing the copy we made. But we were mistakenly allocating the buffer on the stack, so the memory checking tools never caught the leak. On 96 nodes, 384 processes, mpirun memory usage went from about 12M to 3M for me after this minor change... This commit was SVN r14257.	2007-04-07 02:25:48 +00:00
Josh Hursey	dadca7da88	Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). This merge adds Checkpoint/Restart support to Open MPI. The initial frameworks and components support a LAM/MPI-like implementation. This commit follows the risk assessment presented to the Open MPI core development group on Feb. 22, 2007. This commit closes trac:158 More details to follow. This commit was SVN r14051. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r13912 The following Trac tickets were found above: Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158	2007-03-16 23:11:45 +00:00
George Bosilca	3fd278c522	Make the tree compile in debug mode. This commit was SVN r12724.	2006-12-01 23:03:09 +00:00
Ralph Castain	897744cdeb	Two major changes to the runtime: 1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1". 2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial". 3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1". 4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer) This commit was SVN r12722.	2006-12-01 22:30:39 +00:00
Ralph Castain	d0eb7d7216	Complete the attribute management functions. Modify the mapper to better bookmark its stopping place each time, and to pick up the next time from there. This needs to be validated on a multi-node system. Fix a major memory corruption problem in the registry put/get functions that was doing multiple free's. Not sure how valgrind missed this one, though it only occurred in specific circumstances (such as comm_spawn). This commit was SVN r12179.	2006-10-18 20:02:16 +00:00
Ralph Castain	5dfd54c778	With the branch to 1.2 made.... Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced). Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up). I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t). In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but... Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems. This commit was SVN r11204.	2006-08-15 19:54:10 +00:00
Brian Barrett	566a050c23	Next step in the project split, mainly source code re-arranging - move files out of toplevel include/ and etc/, moving it into the sub-projects - rather than including config headers with <project>/include, have them as <project> - require all headers to be included with a project prefix, with the exception of the config headers ({opal,orte,ompi}_config.h mpi.h, and mpif.h) This commit was SVN r8985.	2006-02-12 01:33:29 +00:00
Ralph Castain	4b9f015c0b	Merge in the new data support subsystem for ORTE. MPI folks should not notice a difference. Longer explanation will be sent to developers mailing list. This commit was SVN r8912.	2006-02-07 03:32:36 +00:00
Tim Woodall	7f20198d49	Filter the set of data returned to the daemons during startup using the new get_conditional command to improve scalability during launch This commit was SVN r8097.	2005-11-10 16:44:51 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
Jeff Squyres	1b691f8089	Pull NULL checks around releasing of resources to ensure we don't segv. This commit was SVN r7971.	2005-11-03 11:27:19 +00:00
Jeff Squyres	60b0330bc1	Initialize "conditions" to ensure we don't segv This commit was SVN r7961.	2005-11-01 17:13:18 +00:00
Ralph Castain	399e41d113	Fix a potential memory leak... This commit was SVN r7960.	2005-11-01 15:17:11 +00:00
Jeff Squyres	a2e507c629	Fix potential segv through uninitialized variable This commit was SVN r7946.	2005-11-01 13:09:00 +00:00
Ralph Castain	afeeacd76d	Complete hookup of the registry proxy for the get_conditional command. This commit was SVN r7915.	2005-10-28 05:35:07 +00:00
Ralph Castain	eebda71a0b	Add a new API to the registry for conditional data retrievals. The new API allows you to retrieve data from registry containers that have key-value pairs where the value matches the specified one. The requested keys are then retrived from that container. This commit was SVN r7907.	2005-10-28 00:30:58 +00:00
Tim Woodall	c0124fecdd	changed segment dictionary to hash table to improve search time for reverse lookup This commit was SVN r7893.	2005-10-27 17:00:47 +00:00
Tim Woodall	88c7fd9f8d	add support for a "persistent" non-blocking receive doesn't require a re-registration on every receive This commit was SVN r7822.	2005-10-20 22:06:11 +00:00
Josh Hursey	d39841174d	Must release the lock before entering the non blocking recv, since it is possible that if the receive has been arrived the callback will be called before recv_buffer_nb() returns. This causes deadlock as we try to acquire the lock, but already hold it. This was causing orterun and orteds to stall in certian situations. Became evident when stress testing dynamics with remote nodes. This commit was SVN r7543.	2005-09-29 14:24:11 +00:00
Ralph Castain	b589a93e29	Continue to lace the trace functionality into orte... This commit was SVN r7427.	2005-09-19 15:29:14 +00:00
Josh Hursey	575afef072	Use non blocking sends in orte_gpr_replica_remote_notify. This fixes one of the race conditions in orterun is sent a kill signal. Before it would sometimes spin in the OOB waiting for a message to complete to a peer that was no longer around. Stalling at this level prevented orterun from noticing that it had received a kill signal. This commit was SVN r7408.	2005-09-16 15:34:44 +00:00
Jeff Squyres	f4e8fe4817	Arrgh -- stupid mistake on last commit -- accidentally replaced a LIBADD instead of appending to the existing one. Also removed some more Makefile.options whitespace, and I think emacs removed some tabs (i.e., replaced them with whitespace). This commit was SVN r7399.	2005-09-15 21:37:24 +00:00
Brian Barrett	ed56e743b7	* update configure.ac to use the modern version of AC_INIT and AM_INIT_AUTOMAKE, instead of the deprecated version. * Work around dumbness in modern AC_INIT that requires the version number to be set at autoconf time (instead of at configure time, as it was before). Set the version number, minus the subversion r number, at autoconf time. Override the internal variables to include the r number (if needed) at configure time. Basically, the right thing should always happen. The only place it might not is the version reported as part of configure --help will not have an r number. * Since AM_INIT_AUTOMAKE taks a list of options, no need to specify them in all the Makefile.am files. * Addes support for subdir-objects, meaning that object files are put in the directory containing source files, even if the Makefile.am is in another directory. This should start making it feasible to reduce the number of Makefile.am files we have in the tree, which will greatly reduce the time to run autogen and configure. This commit was SVN r7211.	2005-09-07 05:54:53 +00:00
Ralph Castain	f352890732	Cleaning up memory leaks for proxy operations. This commit was SVN r7157.	2005-09-02 19:26:21 +00:00
Ralph Castain	96f4bb7a63	Hey, sports fans!! Guess what?? Here's the huge registry check-in you've all been waiting for with baited breath. The revised version sends a single message to all processes at the various stage gates, thus making the startup much more scalable. I could provide you with all the tawdry details, but won't for now - you are welcome to ask, though, and I'll merrily bore your ears to tears. In addition, the commit contains the following: 1. set the ignore properties on ompi/debuggers and orte/mca/pls/poe 2. Added simplified subscribe and put functions to the registry's API. I have also converted all of the ompi functions that registered subscriptions to the new API, and caught their associated put's as well. In a follow-on commit, I'll be adding support for George's hetero arch registry subscription (wanted to get this one in first). This commit was SVN r7118.	2005-09-01 01:07:30 +00:00
Ralph Castain	4e1837687b	Finish simplified interfaces for put and subscribe - more details to come. This commit was SVN r6713.	2005-08-02 19:43:29 +00:00
Ralph Castain	8c6c78c47a	Add a few new functions that were requested last week - not tested yet, so please don't use them! I will test them this afternoon on a different computer. For now, they won't cause any problems since they aren't being called. This commit was SVN r6689.	2005-08-01 16:38:15 +00:00
Ralph Castain	f604fb72db	Turn "on" the delete functionality for the registry. Should now be able to delete entries and segments, and get an index of the dictionary entries on the registry. Haven't fully tested these yet (nobody is using them at the moment that I know of - good thing, since they haven't been working for a long time - though I know the MPI-2 stuff needs the functionality), but will do so shortly. For now, they compile. This commit was SVN r6567.	2005-07-20 18:07:46 +00:00
Ralph Castain	19d58ee17e	First phase of the scalable RTE changes: 1. Modify the registry to eliminate redundant data copying for startup messages. 2. Revise the subscription/trigger system to avoid redundant storage of triggers and subscriptions. This dramatically reduces the search time when a registry action occurs - to illustrate the point, there are now only a handful of triggers on the system for each job. Before, there were a handful of triggers for each PROCESS in the job, all of which had to be checked every time something happened on the registry. This is much, much faster now. 3. Update all subscriptions to the new format. There are now "named" subscriptions - this allows you to "name" a subscription that all the processes will be using. The first one to hit the registry actually defines the subscription. From then on, any subsequent "subscribes" to the same name just cause that process to "attach" to the existing subscription. This keeps the number of subscriptions being tracked by the registry to a minimum, while ensuring that each process still gets notified. 4. Do the same for triggers. Also fixed a duplicate subscription problem that was causing people to receive data equal to the number of processes times the data they should have received from a trigger/subscription. Sorry about that... :-( ...but it's all better now! Uncovered a situation where the modex data seems to be getting entered on the registry a second time - the latter time coming after the compound command has been "fired", thereby causing all the subscriptions to fire. Asked Tim and Jeff to look into this. Second phase of the changes will involve modifying the xcast system so that the same message gets sent to all processes. This will further reduce the message traffic, and - once we have a true "broadcast" version of xcast - really speed things up and improve scalability. This commit was SVN r6542.	2005-07-18 18:49:00 +00:00
Ralph Castain	44ace2f64e	Well, I think this will fix the bug Greg encountered when sending no triggers on a subscription. However, I can't test it since the trunk no longer runs on my Mac notebook - I get an error message "No ptl components available. This shouldn't happen." and the processes exit. This commit was SVN r6476.	2005-07-14 01:32:36 +00:00
Ralph Castain	81af57707f	Don't release the message buffer - the messaging function takes care of it. This commit was SVN r6437.	2005-07-12 15:41:45 +00:00
Brian Barrett	a13166b500	* rename ompi_output to opal_output This commit was SVN r6329.	2005-07-03 23:31:27 +00:00
Brian Barrett	39dbeeedfb	* rename locking code from ompi to opal This commit was SVN r6327.	2005-07-03 22:45:48 +00:00
Jeff Squyres	1b18979f79	Initial population of orte tree This commit was SVN r6266.	2005-07-02 13:42:54 +00:00

36 Коммитов