openmpi

Автор	SHA1	Сообщение	Дата
Shiqing Fan	ae41b5418b	Update the RAS and PLM components for Windows. These won't suffer another platforms but only windows. This commit was SVN r17686.	2008-03-04 17:13:01 +00:00
Ralph Castain	ffa232687a	Fix xcast so it works in multi-node situations where the user specifies a particular mode to use (e.g., direct). This commit was SVN r17682.	2008-03-03 20:07:02 +00:00
Ralph Castain	841d0e5208	Cleanup an attribute warning - not sure which one to set or where it should go, so I'll leave that to someone more familiar with "attributes". Ensure some debugging is only enabled when have_debug is set. This commit was SVN r17681.	2008-03-03 16:06:47 +00:00
Rich Graham	d37db14901	get the shared memory collectives working again with the new version of orte. This commit was SVN r17672.	2008-02-29 22:28:57 +00:00
Ralph Castain	6450962d59	Add some debugging to the message event object. Cleanup some no-longer-used values This commit was SVN r17671.	2008-02-29 20:10:31 +00:00
Ralph Castain	a585923de1	Silence some minor compiler warnings This commit was SVN r17662.	2008-02-29 02:39:39 +00:00
Tim Prins	84b2099fe8	Remove the now-unused orte_value_array. As this is the last 'class' split between orte and ompi, remove the big comment about the split in ompi_bitmap. Also, update some properties (source files should not be executeable...), and remove a couple unneeded inclusions of orte_proc_table.h This commit was SVN r17655.	2008-02-28 21:39:42 +00:00
Ralph Castain	5e6928d710	Cleanup recursions in ORTE caused by processing recv'd messages that can cause the system to take action resulting in receipt of another message. Basically, the method employed here is to have a recv create a zero-time timer event that causes the event library to execute a function that processes the message once the recv returns. Thus, any action taken as a result of processing the message occur outside of a recv. Created two new macros to assist: ORTE_MESSAGE_EVENT: creates the zero-time event, passing info in a new orte_message_event_t object ORTE_PROGRESSED_WAIT: while waiting for specified conditions, just calls progress so messages can be recv'd. Also fixed the failed_launch function as we no longer block in the orted callback function. Updated the error messages to reflect revision. No change in API to this function, but PLM "owners" may want to check their internal error messages to avoid duplication and excessive output. This has been tested on Mac, TM, and SLURM. This commit was SVN r17647.	2008-02-28 19:58:32 +00:00
Ralph Castain	5dc64cea6a	Correct logic - only issue recv and cancel it if we are an HNP This commit was SVN r17641.	2008-02-28 15:27:16 +00:00
George Bosilca	9d421bea2a	Replace all occurences of orte_pointer_array by opal_pointer_array. Remove the implementation of orte_pointer_array. This commit was SVN r17636.	2008-02-28 05:32:23 +00:00
Ralph Castain	d70e2e8c2b	Merge the ORTE devel branch into the main trunk. Details of what this means will be circulated separately. Remains to be tested to ensure everything came over cleanly, so please continue to withhold commits a little longer This commit was SVN r17632.	2008-02-28 01:57:57 +00:00
Gleb Natapov	da3e69101d	Add missing include. This commit was SVN r17493.	2008-02-18 14:55:02 +00:00
Galen Shipman	18d1d3b408	Add ORTE ALPS support (Cray XT CNL) This commit was SVN r17482.	2008-02-17 19:29:06 +00:00
George Bosilca	fcab6cc0bb	Fix typo. This commit was SVN r17255.	2008-01-26 21:36:04 +00:00
Rainer Keller	9d4852cdc1	- Get rid of Wshadow warnings. This commit was SVN r17231.	2008-01-25 14:07:38 +00:00
Pak Lui	413bcca4c0	Support the qrsh or qsub "-notify" option by catching the SIGUSR1/2 signals and not letting user processes to exit on those signals. This commit was SVN r17174.	2008-01-22 17:32:29 +00:00
Josh Hursey	158dda5458	Fix some overlapping code. This commit was SVN r17067.	2008-01-08 15:40:21 +00:00
George Bosilca	eb71a634c6	Don't forget to initialize the msg_origin field. This commit was SVN r17055.	2008-01-04 23:24:49 +00:00
George Bosilca	48f5a26e8c	Cast to keep VC happy (quiet). This commit was SVN r17054.	2008-01-04 23:13:32 +00:00
Adrian Knoth	42d5fe62f9	Fixed misplaced #endif This commit was SVN r17028.	2008-01-01 11:02:38 +00:00
Jeff Squyres	213b5d5c6e	Per long threads on the mailing list and much confusion discussion about linkers, have all OPAL, ORTE, and OMPI components '''not'' link against the OPAL, ORTE, or OMPI libraries. See ttp://www.open-mpi.org/community/lists/users/2007/10/4220.php for details (or https://svn.open-mpi.org/trac/ompi/wiki/Linkers for a better-formatted version of the same info). This commit was SVN r16968.	2007-12-15 13:32:02 +00:00
Josh Hursey	f7812baf5b	forgot a bit of error checking in the last commit This commit was SVN r16953.	2007-12-13 14:41:18 +00:00
Josh Hursey	a287c9cb65	This commit distinguishes the file transfer stage from the finish stage. This commit also cleans up the checkpoint and terminate case making it more precise than before. Previously the application could make a small amount of progress between checkpoint completion and application termination. Now the application will make no progress at all in this time span. Additional minor change: - Start using OPAL_INT_TO_BOOL instead of if/else logic This commit was SVN r16952.	2007-12-13 14:37:17 +00:00
Rolf vandeVaart	3ea89b69ae	Remove a few tabs. Allow the output stream to be passed to the close command for verbose output. This matches all the other frameworks. This commit was SVN r16938.	2007-12-11 20:44:56 +00:00
Josh Hursey	27c9016b93	sleep -> usleep so we can be a bit more eager when waiting for events to finish. Still working on solutions that do not involve sleeping, but this will do for now. This commit was SVN r16824.	2007-12-03 19:27:32 +00:00
Jeff Squyres	c20350b943	Patch submitted by Brian Barrett, inspired by this thread: http://www.open-mpi.org/community/lists/users/2007/11/4547.php. - Better handling of ECONNABORTED from connect on Linux. - Reduce extraneous output from OOB when TCP connections must be retried. This commit was SVN r16808.	2007-11-30 21:42:15 +00:00
Ron Brightwell	edb9d8e354	Added Catamount to the conditional compilation since Catamount doesn't support fork() or pipe() either. This removes a linker warning message when building for Cray XT with Catamount. This commit was SVN r16772.	2007-11-21 21:37:58 +00:00
George Bosilca	d67c0eefb4	Remove a compilation warning about using uninitialized variables. This commit was SVN r16589.	2007-10-26 20:15:28 +00:00
George Bosilca	b1b5cb6453	Looks like SO_REUSEPORT it's not defined on some platforms. Switch to the conventional SO_REUSEADDR instead. This commit was SVN r16588.	2007-10-26 19:56:21 +00:00
George Bosilca	337f78a4a8	Restrict the port range for the OOB and the BTL. Each protocols (v4 and v6) has his own range which is defined by a min value and a range. By default there is no limitation on the port range, which is exactly the same behavior as before. This commit was SVN r16584.	2007-10-26 16:36:51 +00:00
Jeff Squyres	9e4387d021	* Use new BEGIN_C_DECLS / END_C_DECLS convention * Add newline at end of file to avoid compiler warning This commit was SVN r16579.	2007-10-26 13:40:38 +00:00
Shiqing Fan	3c38c9c020	- Add extern "C" to resolve linkage specification problems. This commit was SVN r16577.	2007-10-26 09:54:42 +00:00
Ralph Castain	a791ce2299	The processor affinity must be set on a per-process basis, not per-app-context. This commit was SVN r16559.	2007-10-23 20:46:16 +00:00
George Bosilca	7a63f9b730	I somehow mess up my last commit. Sorry. This commit was SVN r16543.	2007-10-22 15:08:17 +00:00
George Bosilca	b93f72bdfd	Remove 2 warnings about uninitialized i and quit_flags. This commit was SVN r16542.	2007-10-22 15:01:15 +00:00
Jeff Squyres	5637c7a5a0	In addition to r16513, this commit fixes trac:1170. If we cannot resolve the route to the peer that we're trying to send to, don't queue up the message in the TCP OOB -- instead, return it to the upper layer (e.g., the RML) and let it decide what to do. In the case of the routed RML, the tree component will queue it up for later transmission. Hence, we don't want the message queued up both here in the TCP OOB and the tree routed. Also see some more discussion / explanation in #1171. This commit was SVN r16540. The following SVN revision numbers were found above: r16513 --> open-mpi/ompi@7ae9589d70 The following Trac tickets were found above: Ticket 1170 --> https://svn.open-mpi.org/trac/ompi/ticket/1170	2007-10-22 13:46:57 +00:00
Jeff Squyres	7ae9589d70	The header is at the address of the buffer pointed to by the iov, not the address of the iov. This commit was SVN r16513.	2007-10-19 12:40:14 +00:00
Jeff Squyres	abf1b728b9	Minor code maintenance fix -- put the THREAD_UNLOCK outside the if statement so that you only have to have it once. This commit was SVN r16512.	2007-10-19 12:36:26 +00:00
Ralph Castain	73eeb7f0d2	Fix a bug in the way we handled buffer releases and the conditioned wait that held us in the xcast until completed. This commit was SVN r16504.	2007-10-19 01:17:01 +00:00
Josh Hursey	0bf61a1b84	Move in some accumulated small features and minor bug fixes for C/R support. {{{ svn merge -r 16447:16475 https://svn.open-mpi.org/svn/ompi/tmp/jjh-fgs . }}} This commit was SVN r16478.	2007-10-17 13:47:36 +00:00
Ralph Castain	ec5fe78876	When in the unity message routing mode, we have to update the RML contact info in the parent procs so that they know how to talk to the children. Ideally, this would be done in the MPI layer since that layer knows which procs are actively involved in the comm_spawn. However, it isn't being done there, which causes comm_spawn to fail, so do it explicitly in the RTE. Note that this means ALL procs in the parent job are updated, even though they may not be participating in the comm_spawn. This doesn't really hurt anything - just unnecessary. Comm_spawn still has a problem when a child process shares a node with a parent, so this doesn't fix everything. It only fixes the bug of ensuring all procs know how to talk to each other. This commit was SVN r16460.	2007-10-16 16:09:41 +00:00
Ralph Castain	713b6e13a5	Improve diagnostic output messages when errors are hit This commit was SVN r16457.	2007-10-16 14:51:52 +00:00
Josh Hursey	ea0652d20f	If we are going to pretend to do filem, then we should always pretend. No one should be using this feature except for me. :) This commit was SVN r16454.	2007-10-15 20:04:35 +00:00
Ralph Castain	b6196e8a39	When we can detect that a daemon has failed, then we would like to terminate the system without having it lock up. The "hang" is currently caused by the system attempting to send messages to the daemons (specifically, ordering them to kill their local procs and then terminate). Unfortunately, without some idea of which daemon has died, the system hangs while attempting to send a message to someone who is no longer alive. This commit introduces the necessary logic to avoid that conflict. If a PLS component can identify that a daemon has failed, then we will set a flag indicating that fact. The xcast system will subsequently check that flag and, if it is set, will send all messages direct to the recipient. In the case of "kill local procs" and "terminate", the messages will go directly to each orted, thus bypassing any orted that has failed. In addition, the xcast system will -not- wait for the messages to complete, but will return immediately (i.e., operate in non-blocking mode). Orterun will wait (via an event timer) for a period of time based on the number of daemons in the system to allow the messages to attempt to be delivered - at the end of that time, orterun will simply exit, alerting the user to the problem and -strongly- recommending they run orte-clean. I could only test this on slurm for the case where all daemons unexpectedly died - srun apparently only executes its waitpid callback when all launched functions terminate. I have asked that Jeff integrate this capability into the OOB as he is working on it so that we execute it whenever a socket to an orted is unexpectedly closed. Meantime, the functionality will rarely get called, but at least the logic is available for anyone whose environment can support it. This commit was SVN r16451.	2007-10-15 18:00:30 +00:00
Jeff Squyres	423f23eb6a	Fixes trac:1160. There is still some other problem in the OOB, but we wanted to commit this to get wider testing. This commit was SVN r16445. The following Trac tickets were found above: Ticket 1160 --> https://svn.open-mpi.org/trac/ompi/ticket/1160	2007-10-15 15:41:36 +00:00
Josh Hursey	f16a42947a	Change some default MCA parameters: - Global snapshot directory = $HOME - FileM 'rsh' = 'ssh' - FileM 'rcp' = 'scp' This commit was SVN r16444.	2007-10-15 15:21:17 +00:00
Josh Hursey	520c27ac94	If the HNP is acting as the orted for local launch then the gpr_replica variable is not defined. Make sure to set it to something reasonable so that file preloading still works (instead of seg faulting :) Thanks to Hiep Bui Hoang for reporting this bug. This commit was SVN r16433.	2007-10-11 19:47:04 +00:00
Josh Hursey	e483c36cea	Remove a big of debug in filem/rsh that should have never been committed. A guesture towards overlapping file removal with metadata update. This commit was SVN r16432.	2007-10-11 19:37:33 +00:00
Ralph Castain	3dbd4d9be7	Squeeeeeeze the launch message. This is the message sent to the daemons that provides all the data required for launching their local procs. In reorganizing the ODLS framework, I discovered that we were sending a significant amount of unnecessary and repeated data. This commit resolves this by: 1. taking advantage of the fact that we no longer create the launch message via a GPR trigger. In earlier times, we had the GPR create the launch message based on a subscription. In that mode of operation, we could not guarantee the order in which the data was stored in the message - hence, we had no choice but to parse the message in a loop that checked each value against a list of possible "keys" until the corresponding value was found. Now, however, we construct the message "by hand", so we know precisely what data is in each location in the message. Thus, we no longer need to send the character string "keys" for each data value any more. This represents a rather large savings in the message size - to give you an example, we typically would use a 30-char "key" for a 2-byte data value. As you can see, the overhead can become very large. 2. sending node-specific data only once. Again, because we used to construct the message via subscriptions that were done on a per-proc basis, the data for each node (e.g., the daemon's name, whether or not the node was oversubscribed) would be included in the data for each proc. Thus, the node-specific data was repeated for every proc. Now that we construct the message "by hand", there is no reason to do this any more. Instead, we can insert the data for a specific node only once, and then provide the per-proc data for that node. We therefore not only save all that extra data in the message, but we also only need to parse the per-node data once. The savings become significant at scale. Here is a comparison between the revised trunk and the trunk prior to this commit (all data was taken on odin, using openib, 64 nodes, unity message routing, tested with application consisting of mpi_init/mpi_barrier/mpi_finalize, all execution times given in seconds, all launch message sizes in bytes): Per-node scaling, taken at 1ppn: #nodes original trunk revised trunk time size time size 1 0.10 819 0.09 564 2 0.14 1070 0.14 677 3 0.15 1321 0.14 790 4 0.15 1572 0.15 903 8 0.17 2576 0.20 1355 16 0.25 4584 0.21 2259 32 0.28 8600 0.27 4067 64 0.50 16632 0.39 7683 Per-proc scaling, taken at 64 nodes ppn original trunk revised trunk time size time size 1 0.50 16669 0.40 7720 2 0.55 32733 0.54 11048 3 0.87 48797 0.81 14376 4 1.0 64861 0.85 17704 Condensing those numbers, it appears we gained: per-node message size: 251 bytes/node -> 113 bytes/node per-proc message size: 251 bytes/proc -> 52 bytes/proc per-job message size: 568 bytes/job -> 399 bytes/job (job-specific data such as jobid, override oversubscribe flag, total #procs in job, total slots allocated) The fact that the two pre-commit trunk numbers are the same confirms the fact that each proc was containing the node data as well. It isn't quite the 10x message reduction I had hoped to get, but it is significant and gives much better scaling. Note that the timing info was, as usual, pretty chaotic - the numbers cited here were typical across several runs taken after the initial one to avoid NFS file positioning influences. Also note that this commit removes the orte_process_info.vpid_start field and the handful of places that passed that useless value. By definition, all jobs start at vpid=0, so all we were doing is passing "0" around. In fact, many places simply hardwired it to "0" anyway rather than deal with it. This commit was SVN r16428.	2007-10-11 15:57:26 +00:00
Rolf vandeVaart	25c95c9ee9	Fix build on solaris. Need to include sys/wait.h. This commit was SVN r16426.	2007-10-11 15:04:30 +00:00

1 2 3 4 5 ...

1134 Коммитов