openmpi

Автор	SHA1	Сообщение	Дата
Iain Bason	18d9e96301	Fixed two problems: 1. The code that looks at btl_tcp_if_exclude before doing a modex_send uses strcmp rather than strncmp. That means that "lo0" gets sent even though "lo" is excluded. 2. The code that determines whether a particular local TCP interface can connect to a particular remote interface doesn't check for loopback interfaces. With this fix, users can now enable "lo" and be assured that it will only be used for intra- node communication. This commit was SVN r22762.	2010-03-03 15:51:15 +00:00
Rolf vandeVaart	2715141f6d	Fix minor bug in the way we handle btl_tcp_if_include list. This commit was SVN r22722.	2010-02-26 18:08:04 +00:00
George Bosilca	7eff2cdf85	Unrestricted number of interfaces. This commit was SVN r22669.	2010-02-19 07:10:32 +00:00
Ralph Castain	ded58ae483	Silence some compiler warnings about print statements This commit was SVN r21814.	2009-08-13 13:45:38 +00:00
Jeff Squyres	f960f2d944	Fix compiler warning This commit was SVN r21312.	2009-05-28 13:34:48 +00:00
Rainer Keller	b2f8095ba7	- Update to fix in r21234: as discussed on devel@, for printing size_t use "%lu" and cast to (unsigned long). This commit was SVN r21238. The following SVN revision numbers were found above: r21234 --> open-mpi/ompi@22b6177fb9	2009-05-14 14:10:22 +00:00
Rainer Keller	22b6177fb9	- Use the "z" length modifier for size_t arguments for printf. This commit was SVN r21234.	2009-05-14 00:52:20 +00:00
Rainer Keller	29b1b205fd	- Remove two headers (and actually include rml.h) prior to test of removal script... This commit was SVN r20765.	2009-03-12 17:58:39 +00:00
George Bosilca	760e744294	Use a more clear name for the proc in the constructor and destructor functions. Make sure the lock is created and destroyed as expected. This commit was SVN r20197.	2009-01-05 14:14:38 +00:00
Rolf vandeVaart	cad49da72d	Fix the tcp btl so it makes use of the btl_tcp_if_include and btl_tcp_if_exclude parameters on the connecting side also. Also move define of IF_NAMESIZE into if.h file. And lastly, add one verbose debug message which may be useful if we run into other issues like this. This commit fixes trac:1573. This commit was SVN r19932. The following Trac tickets were found above: Ticket 1573 --> https://svn.open-mpi.org/trac/ompi/ticket/1573	2008-11-05 18:45:42 +00:00
Shiqing Fan	04ee20a880	- Mainly type casts. Microsoft VC++ compiler is too strict. This commit was SVN r19517.	2008-09-08 15:39:30 +00:00
Ralph Castain	9613b3176c	Effectively revert the orte_output system and return to direct use of opal_output at all levels. Retain the orte_show_help subsystem to allow aggregation of show_help messages at the HNP. After much work by Jeff and myself, and quite a lot of discussion, it has become clear that we simply cannot resolve the infinite loops caused by RML-involved subsystems calling orte_output. The original rationale for the change to orte_output has also been reduced by shifting the output of XML-formatted vs human readable messages to an alternative approach. I have globally replaced the orte_output/ORTE_OUTPUT calls in the code base, as well as the corresponding .h file name. I have test compiled and run this on the various environments within my reach, so hopefully this will prove minimally disruptive. This commit was SVN r18619.	2008-06-09 14:53:58 +00:00
Jeff Squyres	e7ecd56bd2	This commit represents a bunch of work on a Mercurial side branch. As such, the commit message back to the master SVN repository is fairly long. = ORTE Job-Level Output Messages = Add two new interfaces that should be used for all new code throughout the ORTE and OMPI layers (we already make the search-and-replace on the existing ORTE / OMPI layers): * orte_output(): (and corresponding friends ORTE_OUTPUT, orte_output_verbose, etc.) This function sends the output directly to the HNP for processing as part of a job-specific output channel. It supports all the same outputs as opal_output() (syslog, file, stdout, stderr), but for stdout/stderr, the output is sent to the HNP for processing and output. More on this below. * orte_show_help(): This function is a drop-in-replacement for opal_show_help(), with two differences in functionality: 1. the rendered text help message output is sent to the HNP for display (rather than outputting directly into the process' stderr stream) 1. the HNP detects duplicate help messages and does not display them (so that you don't see the same error message N times, once from each of your N MPI processes); instead, it counts "new" instances of the help message and displays a message every ~5 seconds when there are new ones ("I got X new copies of the help message...") opal_show_help and opal_output still exist, but they only output in the current process. The intent for the new orte_* functions is that they can apply job-level intelligence to the output. As such, we recommend that all new ORTE and OMPI code use the new orte_* functions, not thei opal_* functions. === New code === For ORTE and OMPI programmers, here's what you need to do differently in new code: * Do not include opal/util/show_help.h or opal/util/output.h. Instead, include orte/util/output.h (this one header file has declarations for both the orte_output() series of functions and orte_show_help()). * Effectively s/opal_output/orte_output/gi throughout your code. Note that orte_output_open() takes a slightly different argument list (as a way to pass data to the filtering stream -- see below), so you if explicitly call opal_output_open(), you'll need to slightly adapt to the new signature of orte_output_open(). * Literally s/opal_show_help/orte_show_help/. The function signature is identical. === Notes === * orte_output'ing to stream 0 will do similar to what opal_output'ing did, so leaving a hard-coded "0" as the first argument is safe. * For systems that do not use ORTE's RML or the HNP, the effect of orte_output_* and orte_show_help will be identical to their opal counterparts (the additional information passed to orte_output_open() will be lost!). Indeed, the orte_* functions simply become trivial wrappers to their opal_* counterparts. Note that we have not tested this; the code is simple but it is quite possible that we mucked something up. = Filter Framework = Messages sent view the new orte_* functions described above and messages output via the IOF on the HNP will now optionally be passed through a new "filter" framework before being output to stdout/stderr. The "filter" OPAL MCA framework is intended to allow preprocessing to messages before they are sent to their final destinations. The first component that was written in the filter framework was to create an XML stream, segregating all the messages into different XML tags, etc. This will allow 3rd party tools to read the stdout/stderr from the HNP and be able to know exactly what each text message is (e.g., a help message, another OMPI infrastructure message, stdout from the user process, stderr from the user process, etc.). Filtering is not active by default. Filter components must be specifically requested, such as: {{{ $ mpirun --mca filter xml ... }}} There can only be one filter component active. = New MCA Parameters = The new functionality described above introduces two new MCA parameters: * '''orte_base_help_aggregate''': Defaults to 1 (true), meaning that help messages will be aggregated, as described above. If set to 0, all help messages will be displayed, even if they are duplicates (i.e., the original behavior). * '''orte_base_show_output_recursions''': An MCA parameter to help debug one of the known issues, described below. It is likely that this MCA parameter will disappear before v1.3 final. = Known Issues = * The XML filter component is not complete. The current output from this component is preliminary and not real XML. A bit more work needs to be done to configure.m4 search for an appropriate XML library/link it in/use it at run time. * There are possible recursion loops in the orte_output() and orte_show_help() functions -- e.g., if RML send calls orte_output() or orte_show_help(). We have some ideas how to fix these, but figured that it was ok to commit before feature freeze with known issues. The code currently contains sub-optimal workarounds so that this will not be a problem, but it would be good to actually solve the problem rather than have hackish workarounds before v1.3 final. This commit was SVN r18434.	2008-05-13 20:00:55 +00:00
Adrian Knoth	c53d3c3c22	reverted r18169,r18170 due to connection reset by peer on odin/sif This commit was SVN r18255. The following SVN revision numbers were found above: r18169 --> open-mpi/ompi@20473bfda2 r18170 --> open-mpi/ompi@d34dfbe12c	2008-04-23 15:26:15 +00:00
Ralph Castain	fa082cafa9	Shift the architecture calculation from the ompi/datatype engine to the opal/util area. This allows us to compute the architecture earlier in the launch and communicate it outside of the modex. Note: this is an early preliminary step in the movement of portions of the datatype engine to the opal layer. This commit was SVN r18198.	2008-04-17 20:43:56 +00:00
Adrian Knoth	d34dfbe12c	fixed misleading comment. This commit was SVN r18170.	2008-04-16 11:26:15 +00:00
Adrian Knoth	20473bfda2	on incoming connections, compare with every possible source address. Rational (taken from the code): /* This is PITA. We never know which source address an * incoming/outgoing packet will have, so even with * btl_tcp_if_include/exclude on the remote end, we * might get a different source address. * * If this address isn't included in btl_proc->proc_addrs, * we would erroneously drop the connection */ merge -r18165:18167 to the trunk. This commit was SVN r18169. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r18165 r18167	2008-04-16 11:24:09 +00:00
Adrian Knoth	e981a259bb	btl_tcp_disable_family=4 and btl_tcp_disable_family=6 are mutually exclusive, so this should result in "unreachable" when set differently between peers. This commit was SVN r18168.	2008-04-16 10:14:58 +00:00
Tim Prins	5de3e1965e	Remove the orte_proc_table. Migrate all users of it to the opal_hash_table and a new name hash function in orte. Everything should work, however I am unable to compile and test the sctp BTL. This commit was SVN r17751.	2008-03-05 22:44:35 +00:00
Adrian Knoth	f1648f08df	Advanced address selection code from Thomas Peiselt. Re #1207 , #1027 This commit was SVN r17450.	2008-02-13 21:53:00 +00:00
Adrian Knoth	8ae4a10b4c	Reverted r17331, r17332. Still broken. I'm in a bad hurry. :-( Re #1206 This commit was SVN r17333. The following SVN revision numbers were found above: r17331 --> open-mpi/ompi@3846e2a797 r17332 --> open-mpi/ompi@c03de08c55	2008-01-30 16:51:55 +00:00
Adrian Knoth	c03de08c55	Logic is wrong. I'm going to revert it again. Re #1206 This commit was SVN r17332.	2008-01-30 16:48:50 +00:00
Adrian Knoth	3846e2a797	When checking incoming connections, also care about aliased interfaces. Re #1206 This commit was SVN r17331.	2008-01-30 16:45:41 +00:00
Adrian Knoth	7f79c68930	Reverted r17307 and r17308. It broke parallel TCP connections. Re #1206 This commit was SVN r17329. The following SVN revision numbers were found above: r17307 --> open-mpi/ompi@7a59b3f58c r17308 --> open-mpi/ompi@72b29bc21f	2008-01-30 14:31:47 +00:00
Adrian Knoth	72b29bc21f	Cosmetic patch. Use IN6_ARE_ADDR_EQUAL instead of memcmp(). Re #1206 . This commit was SVN r17308.	2008-01-29 16:02:24 +00:00
Adrian Knoth	7a59b3f58c	accept incoming connections from hosts with multiple addresses. We loop over all peer addresses and accept when one of them matches. Note that this might break functionality: mca_btl_tcp_proc_insert now always inserts the same endpoint. (is the lack of endpoints the problem? should there be one for every remote address?) Re #1206 This commit was SVN r17307.	2008-01-29 15:55:56 +00:00
George Bosilca	6310ce955c	The first patch related to the Active Message stuff. So far, here is what we have: - the registration array is now global instead of one by BTL. - each framework have to declare the entries in the registration array reserved. Then it have to define the internal way of sharing (or not) these entries between all components. As an example, the PML will not share as there is only one active PML at any moment, while the BTLs will have to. The tag is 8 bits long, the first 3 are reserved for the framework while the remaining 5 are use internally by each framework. - The registration function is optional. If a BTL do not provide such function, nothing happens. However, in the case where such function is provided in the BTL structure, it will be called by the BML, when a tag is registered. Now, it's time for the second step... Converting OB1 from a switch based PML to an active message one. This commit was SVN r17140.	2008-01-15 05:32:53 +00:00
Brian Barrett	8b9e8054fd	Move modex from pml base to general ompi runtime, sicne it's used by more than just the PML/BTLs these days. Also clean up the code so that it handles the situation where not all nodes register information for a given node (rather than just spinning until that node sends information, like we do today). Includes r15234 and r15265 from the /tmp/bwb-modex branch. This commit was SVN r15310. The following SVN revisions from the original message are invalid or inconsistent and therefore were not cross-referenced: r15234 r15265	2007-07-09 17:16:34 +00:00
Brian Barrett	33a5758521	Some IPv6 improvements: * Move ipv6comat.h code into opal_config_bottom.h and change into some more intelligent testing of structures * Change opal's if interface to use sockaddr instead of sockaddr_storage, as the RFCs suggest we do * Move the networking code in opal that isn't directly related to if detection into net.h * Add quicky function to get the port out of either a sockaddr_in or sockaddr_in6, saving a bunch of code in the oob. * Update TCP oob and btl with new interface This commit was SVN r14679.	2007-05-17 01:17:59 +00:00
Adrian Knoth	d63d125a88	I guess we only need this when IPv6 is enabled. This commit was SVN r14551.	2007-04-29 16:38:34 +00:00
Adrian Knoth	5765ecc22e	This patch reverts r14549 while retaining IPv6 support. Re #1008 This commit was SVN r14550. The following SVN revision numbers were found above: r14549 --> open-mpi/ompi@386baed55b	2007-04-29 16:23:11 +00:00
Adrian Knoth	386baed55b	Hotfix for IPv6 support. Closes trac:1008 This commit was SVN r14549. The following Trac tickets were found above: Ticket 1008 --> https://svn.open-mpi.org/trac/ompi/ticket/1008	2007-04-29 13:46:45 +00:00
George Bosilca	46265db0a9	Update the TCP BTL in order to bring back some of the functionalities lost during the IPv6 patch. The most important is the multi BTL support. There was a quite interesting bug. Instead of setting up the multiple connections over different physical devices, based on the time when these connections were created most of the time they were all using the same physical network. Which, of course, was not the intended goal, as we top at the maximum bandwidth available over one device instead of gathering all available bandwidth from all devices. Second, the IPv6 RFC suggest to use sockaddr_storage as a holder for the IP information, but use a sockaddr* when we pass it to functions. This is only partially corrected by this patch. Some other minor cleanups. This commit was SVN r14544.	2007-04-28 19:13:47 +00:00
Brian Barrett	4b8bb70afb	A couple cleanups for the IPv6 support: - make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6 so that it works for IPv4 and IPv6 addresses, and remove a whole bunch of #ifs in the OOOB code. - Fix a compiler warning in the TCP BTL due to run-time determined array size by making it a dynamicly allocated array. - Fix the unpacking code of IPv4 addresses when using IPv6 support, so that the address is in the correct location (instead of in an IPv6 structure, use an IPv4 structure). Refs trac:1005. This commit was SVN r14514. The following Trac tickets were found above: Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005	2007-04-25 19:08:07 +00:00
Adrian Knoth	d1ce39de4f	Move mca_btl_tcp_addr_isipv4public to opal_addr_isipv4public This commit was SVN r14512.	2007-04-25 18:06:06 +00:00
Jeff Squyres	c4c68e666a	Merge in the ipv6 work from /tmp/ipv6-merge. This commit was SVN r14503.	2007-04-25 01:55:40 +00:00
Jeff Squyres	7b59847765	Ensure that endpoint->endpoint_addr is not NULL before trying to derefence through it. It is legal for endpoint_addr to be NULL in the destructor because if btl_tcp_add_procs() -> btl_tcp_proc_insert() returns UNREACH, then endpoint_addr will be NULL and we'll OBJ_RELEASE it. This commit was SVN r9940.	2006-05-16 19:01:08 +00:00
Brian Barrett	9b19e3fef0	* remove some debugging output that shouldn't have been committed. Doh! This commit was SVN r9171.	2006-02-27 16:23:52 +00:00
Brian Barrett	285581dff2	More endian-related cleanups: - moved hton64 and ntoh64 from the bunch of places it had been copied into one header file - properly set and use the btl_tcp's nbo option to put things in network byte order on the wire if both sides don't have the same endianness - Put the OB1 PML's headers (with a couple exceptions I need to discuss with Tim) in network byte order on the wire if both sides don't have the same endianness - since it was needed for the TCP BTL, move the orte_process_name_t HTON and NTOH macros from the TCP OOB to ns_types.h This commit was SVN r9145.	2006-02-26 00:45:54 +00:00
Jeff Squyres	628125599d	Fix the TCL btl module endpoint matching during setup for the scenario when running an MPI job spanning a node that has two TCP NICs and a node that has one TCP NIC. Previously, for the 2 NIC/module process, we would return the first peer IP address if we couldn't find a subnet match with any of the peer's published IP addresses -- this was to support running OMPI across subnet boundaries. Changed the behavior to only do that behavior if the IP address we're trying to match is public (i.e., not 10.x.y.z, 192.168.x.y, or 172.16.x.y) and any of the remote peer's addresses are public (working on the assumption that if we both have public addresses, they're routable to each other). This definitely will not work in all scenarios, such as when we go to WAN kinds of executions, and will need to be revisited at that time. This commit was SVN r9119.	2006-02-23 02:02:19 +00:00
George Bosilca	1b667067d6	I need to know the number of iovec attached to the fragment. This commit was SVN r8447.	2005-12-10 23:28:16 +00:00
Galen Shipman	5cf2d8d40c	default to first available IP address if no matching subnets found.. This commit was SVN r8125.	2005-11-12 00:31:34 +00:00
Jeff Squyres	42ec26e640	Update the copyright notices for IU and UTK. This commit was SVN r7999.	2005-11-05 19:57:48 +00:00
George Bosilca	8b93cb7661	Rename all the functions starting with mca_base_modex to mca_pml_base_modex. Change all the places where they are used to fit the new name. Remove the code to check the remote arch from the PML. We will have a GPR mechanism in ompi_mpi_initialize to do that. This commit was SVN r6750.	2005-08-05 18:03:30 +00:00
Tim Woodall	2214f0502d	- first cut at tcp btl (working but not optimal) - reworked btl error logging macros This commit was SVN r6701.	2005-08-02 13:20:50 +00:00

45 Коммитов