openmpi

Автор	SHA1	Сообщение	Дата
Brian Barrett	5ec421e1b0	Create a new queue (to simplify locking) for requests that are started but can not be started by the BTL. This commit was SVN r14757.	2007-05-24 17:21:56 +00:00
George Bosilca	bd5be6ed79	Decrease the dependencies on the rest of the Open MPI code base. This commit was SVN r14756.	2007-05-24 16:59:00 +00:00
George Bosilca	7459ab45f1	This is the complete commit for the TCP header issue. Jeff commit a partial fix (r14749) and then backed it out (r14753). As we are unable to send more than a 32 bits length over TCP in one go, there is no reason to have an uint64 length in the header. This reduce the size of the TCP header. This commit was SVN r14755. The following SVN revision numbers were found above: r14749 --> open-mpi/ompi@48c026ce6b r14753 --> open-mpi/ompi@28ed850b4c	2007-05-24 16:40:49 +00:00
George Bosilca	f744e09462	The hopefully final correction for the ticket #919 . Make sure we are always aligned to the max width (MPI_Aint) when we pack the description of a data-type. This commit was SVN r14754.	2007-05-24 16:08:23 +00:00
Jeff Squyres	28ed850b4c	Back out r14749; it wasn't quite ready for prime time yet... This commit was SVN r14753. The following SVN revision numbers were found above: r14749 --> open-mpi/ompi@48c026ce6b	2007-05-24 15:46:15 +00:00
Brian Barrett	1b025798d2	remove some now unneeded volatiles This commit was SVN r14752.	2007-05-24 15:42:06 +00:00
Brian Barrett	1a9f48c89d	Some much needed cleanup of the rdma one-sided component, similar to r14703 for the point-to-point component. * Associate the list of long message requests to poll with the component, not the individual modules * add progress thread that sits on the OMPI request structure and wakes up at the appropriate time to poll the message list to move long messages asynchronously. * Instead of calling opal_progress() all over the place, move to using the condition variables like the rest of the project. Has the advantage of moving it slightly further along in the becoming thread safe thing. * Fix a problem with the passive side of unlock where it could go recursive and cause all kinds of problems, especially when progress threads are used. Instead, have two parts of passive unlock -- one to start the unlock, and another to complete the lock and send the ack back. The data moving code trips the second at the right time. This commit was SVN r14751. The following SVN revision numbers were found above: r14703 --> open-mpi/ompi@2b4b754925	2007-05-24 15:41:24 +00:00
Jeff Squyres	48c026ce6b	Commit a patch from George (reviewed by Brian): reduce the size of the mca_btl_tcp_hdr_t struct and remove the need for the heterogeneous padding by changing the type of the "size" member to be uint32_t (vs. uint64_t). The value would never be greater than 32 bits anyway, so having the type be uint64_t was wasteful. This commit was SVN r14749.	2007-05-24 15:08:57 +00:00
Jeff Squyres	379a4ec5e2	While we're editing MCA params in the oob tcp component, ditch the use of the deprecated MCA param API for registering MCA parameters and update to the current API. This commit was SVN r14747.	2007-05-24 13:01:55 +00:00
Jeff Squyres	839c1db95c	Fix something that has been bugging me for a while: Rename the oob_tcp_include and oob_tcp_exclude MCA parameters to be oob_tcp_if_include and oob_tcp_if_exclude (to match the convention with btl_tcp_if_[in\|ex]clude). Keep "hidden" synonyms oob_tcp_include and oob_tcp_exclude in case anyone is actually using them (and some users undoubtedly are), but do not have them show up in ompi_info --param output. Instead, the new "oob_tcp_if_*" names will show up in ompi_info output. This commit was SVN r14746.	2007-05-24 12:52:26 +00:00
Gleb Natapov	be71b78f6a	Initialize btl_send_limit before use. This commit was SVN r14745.	2007-05-24 08:40:26 +00:00
Josh Hursey	caea631032	Fix a compiler error when not building with ipv6 support. This commit was SVN r14743.	2007-05-23 21:53:00 +00:00
Brian Barrett	5f15becf4e	Allow multiple connections to be started simultaneously when doing the OOB wireup. For small clusters or clusters with decent ARP lookup and connect times, this will have marginal impact. For systems with either bad ARP lookup times or long connect times, increasing this number to something much closer to SOMAXCONN (128 on most modern machines) will result in a faster OOB wireup. Don't set higher than SOMAXCONN or you can end up with lots of connect() retries and we'll end up slower. This commit was SVN r14742.	2007-05-23 21:35:44 +00:00
Brian Barrett	075389f67d	fix some printf warnings This commit was SVN r14740.	2007-05-23 21:19:26 +00:00
Brian Barrett	38b0d22243	Some cleanups to the pt2pt component * Remove unused declaration * remove unused variable warning when not using progress threads * If we're using progress threads, we want to lock, not trylock when in progress, since it was called from the wakeup thread and not the progress function This commit was SVN r14739.	2007-05-23 20:31:25 +00:00
Jeff Squyres	28a102feab	Add .libs to svn:ignore This commit was SVN r14738.	2007-05-23 20:07:02 +00:00
George Bosilca	7485c920c4	Remove the useless comment as it does not reflect the reality anymore. This commit was SVN r14737.	2007-05-23 18:55:02 +00:00
Ralph Castain	02f6e6ab3e	Slight touchup to make it pretty This commit was SVN r14734.	2007-05-23 16:39:18 +00:00
Ralph Castain	3fc227286f	Be sure to NULL terminate the list of keys... This commit was SVN r14733.	2007-05-23 16:35:03 +00:00
Ralph Castain	cffea274b3	Fix the hostfile system. Enable the hostfile RDS to properly setup the name service. Modify the name service so that a caller can provide a valid cellid and update its site info for later retrieval. This commit was SVN r14732.	2007-05-23 16:31:44 +00:00
Sven Stork	ed72cbcaec	- Fix a deadlock problem for threaded builds. We have to release the lock before we wait for the callback, because the callback will try to lock the lock again. (show up in debug+threaded build) This commit was SVN r14731.	2007-05-23 16:11:50 +00:00
Sven Stork	fc932f1fb4	- changes to get the tests running with visibility enabled This commit was SVN r14730.	2007-05-23 15:02:36 +00:00
Josh Hursey	a010ff6e6a	Some updates from the interface change to orte_init This commit was SVN r14729.	2007-05-23 14:44:23 +00:00
Ralph Castain	e6ff7757ab	Modify the new DSS xfer and copy functions so they only xfer/copy the unpacked portion of a buffer's payload. This allows for more rapid transfer of data during message relay without requiring any knowledge of what is in the buffer. Begin work on restoring binomial message distribution method. This commit was SVN r14728.	2007-05-23 14:06:32 +00:00
George Bosilca	b2e805db61	Nothing relevant. Indentation, typos, change PTL to BTL. This commit was SVN r14727.	2007-05-23 14:03:52 +00:00
George Bosilca	50b26ebb6a	Allow the ompi_ddt_init and ompi_ddt_finalize to be visible even when the visibility feature is on. This commit was SVN r14726.	2007-05-23 14:02:08 +00:00
Sven Stork	88f0845c44	- let the pt2pt component compile with threads enabled This commit was SVN r14725.	2007-05-23 12:56:34 +00:00
Adrian Knoth	b2219710d4	Cosmetics. Changed function name in error output. This commit was SVN r14724.	2007-05-23 10:05:18 +00:00
Adrian Knoth	52f3540a9c	As of r14652, use autoconf 2.60 and automake 1.10 This commit was SVN r14723. The following SVN revision numbers were found above: r14652 --> open-mpi/ompi@21e00f6f0c	2007-05-23 10:02:43 +00:00
Ralph Castain	b582d98d4a	Fix singleton comm_spawn so it can see available resources This commit was SVN r14719.	2007-05-22 13:29:07 +00:00
Ralph Castain	5b0abf520b	Don't update our own contact info This commit was SVN r14718.	2007-05-22 13:28:23 +00:00
Brian Barrett	38eab3613b	* Fix race condition with the pending_{in,out} variables -- if we're going to do while(...) { } then we can't change the variables in the ... atomically, but should do it while holding the module lock. * Fix dumb communicator creation error when we don't create the progress stuff (because a window already exists), where we would accidently jump to the error case. This commit was SVN r14715.	2007-05-21 20:53:02 +00:00
Ralph Castain	677eb5e4bc	Ensure the singleton wakes up when comm_spawn fails This commit was SVN r14714.	2007-05-21 20:13:31 +00:00
Ralph Castain	b771cfcce3	Fix compile problem This commit was SVN r14713.	2007-05-21 20:11:03 +00:00
Ralph Castain	4fff584a68	Commit the orted-failed-to-start code. This correctly causes the system to detect the failure of an orted to start and allows the system to terminate all procs/orteds that did start. The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system. Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed. Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief. With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn. Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put". This commit was SVN r14711.	2007-05-21 18:31:28 +00:00
Ralph Castain	c9c2922706	Drat - commit missing file. This commit was SVN r14710.	2007-05-21 17:20:56 +00:00
Brian Barrett	0e9e0c518a	Fix a couple more progress thread related issues... This commit was SVN r14708.	2007-05-21 16:06:14 +00:00
Ralph Castain	3288ce0462	Cleanup the SDS components and move some common code to the base. Modify the seed and singleton SDS code so it returns the right local rank and num_local_procs. This commit was SVN r14707.	2007-05-21 14:57:58 +00:00
Ralph Castain	d9acc93efa	Compute and pass the local_rank and local number of procs (in that proc's job) on the node. To be precise, given this hypothetical launching pattern: host1: vpids 0, 2, 4, 6 host2: vpids 1, 3, 5, 7 The local_rank for these procs would be: host1: vpids 0->local_rank 0, v2->lr1, v4->lr2, v6->lr3 host2: vpids 1->local_rank 0, v3->lr1, v5->lr2, v7->lr3 and the number of local procs on each node would be four. If vpid=0 then does a comm_spawn of one process on host1, the values of the parent job would remain unchanged. The local_rank of the child process would be 0 and its num_local_procs would be 1 since it is in a separate jobid. I have verified this functionality for the rsh case - need to verify that slurm and other cases also get the right values. Some consolidation of common code is probably going to occur in the SDS components to make this simpler and more maintainable in the future. This commit was SVN r14706.	2007-05-21 14:30:10 +00:00
Pavel Shamis	5ceaa605d7	Adding new vendor_part_id for Mellanox Hermon HCA This commit was SVN r14705.	2007-05-21 13:33:54 +00:00
Brian Barrett	1191677b76	Fix dumb threads-related compile issues This commit was SVN r14704.	2007-05-21 03:23:58 +00:00
Brian Barrett	2b4b754925	Some much needed cleanup of the point-to-point one-sided component... * Combine polling of the long requests and buffer requests into one type, and in one place * Associate the list of requests to poll with the component, not the individual modules * add progress thread that sits on the OMPI request structure and wakes up at the appropriate time to poll the message list. Not the best, but without some asynch notification from the PML that a given set of requests has completed, there isn't much better * Instead of calling opal_progress() all over the place, move to using the condition variables like the rest of the project. Has the advantage of moving it slightly futher along in the becoming thread safe thing * Fix a problem with the passive side of unlock where it could go recursive and cause all kinds of problems, especially when progress threads are used. Instead, have two parts of passive unlock -- one to start the unlock, and another to complete the lock and send the ack back. The data moving code trips the second at the right time. This commit was SVN r14703.	2007-05-21 02:21:25 +00:00
Brian Barrett	77a80b2612	sockaddr_storage check needs sys/socket.h This commit was SVN r14702.	2007-05-20 21:54:43 +00:00
Jeff Squyres	97248d6bc6	Add another test to check multiple, concurrent COMM_SPAWN's. This commit was SVN r14701.	2007-05-19 19:02:24 +00:00
Jeff Squyres	47ba3db3b8	Add a simple MPI_COMM_SPAWN_MULTIPLE test. This commit was SVN r14700.	2007-05-19 02:30:53 +00:00
Ralph Castain	fa5a40070d	Test the return status code from comm_dyn_start_processes - if we see an error, then let's report it and not continue on with the comm_spawn procedure! This commit was SVN r14699.	2007-05-18 20:22:32 +00:00
Ralph Castain	180c96bb8f	Clear an erroneous error message pending a more complete fix This commit was SVN r14698.	2007-05-18 14:44:27 +00:00
Tim Mattox	a65d1d6acc	Add the first NEWS entry for 1.2.3 This commit was SVN r14696.	2007-05-18 14:34:11 +00:00
Ralph Castain	6ef39ef525	Extend fix in r14693 to the comm_spawn case This commit was SVN r14694. The following SVN revision numbers were found above: r14693 --> open-mpi/ompi@75d51812a3	2007-05-18 13:34:26 +00:00
Ralph Castain	75d51812a3	Fix the app-failed-to-start capability that was broken by r14554 (holding the caller in rmgr.spawn until the application - as opposed to just the orteds - have started). Allow the rmgr.spawn function to return if the app terminates, correctly handling its return status code to show abnormal termination. Modify orterun to correctly handle the returned status code so it doesn't enter a conditioned wait if the app fails to start since it will never wakeup if it does. This commit was SVN r14693. The following SVN revision numbers were found above: r14554 --> open-mpi/ompi@4510b42638	2007-05-18 13:29:11 +00:00

... 3 4 5 6 7 ...

9892 Коммитов