openmpi

Автор	SHA1	Сообщение	Дата
Nathan Hjelm	1d1cef76df	opal: fix leaks Two leaks are fixed by this commit: - opal_dss.lookup_data_type returns an allocated string. Free it. - opal_ifaddrtokindex was leaking a struct addrinfo. Ensure that is released before returning. cmr=v1.8.2:reviewer=rhc This commit was SVN r31777.	2014-05-15 15:59:41 +00:00
Jeff Squyres	090ce4187a	Fix compiler errors on Solaris, NetBSD, and OpenBSD: * Per http://www.open-mpi.org/community/lists/devel/2013/12/13504.php, protect usage of struct ifreq->ifr_hwaddr * Per http://www.open-mpi.org/community/lists/devel/2013/12/13503.php, avoid #define conflict with the token "if_mtu" * Also fix some whitespace and string naming issues in opal/util/if.c Tested by Paul Hargrove. Refs trac:4010 This commit was SVN r30006. The following Trac tickets were found above: Ticket 4010 --> https://svn.open-mpi.org/trac/ompi/ticket/4010	2013-12-20 11:17:30 +00:00
Ralph Castain	8c5c7d0db4	Correct a bug in handling of oob_tcp_if_include/exclude addresses by using the kernel index instead of the raw index of the interface. Refs trac:3696 This commit was SVN r29522. The following Trac tickets were found above: Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696	2013-10-26 00:47:14 +00:00
Ralph Castain	a200e4f865	As per the RFC, bring in the ORTE async progress code and the rewrite of OOB: * THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE * Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro. *************************************************************************************** I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week. The code is in https://bitbucket.org/rhc/ompi-oob2 WHAT: Rewrite of ORTE OOB WHY: Support asynchronous progress and a host of other features WHEN: Wed, August 21 SYNOPSIS: The current OOB has served us well, but a number of limitations have been identified over the years. Specifically: * it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code) * we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface. * the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients * there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort * only one transport (i.e., component) can be "active" The revised OOB resolves these problems: * async progress is used for all application processes, with the progress thread blocking in the event library * each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on") * multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC. * a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions. * opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object * NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions * obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel * the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport * routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active * all blocking send/recv APIs have been removed. Everything operates asynchronously. KNOWN LIMITATIONS: * although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline * the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker * routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways * obviously, not every error path has been tested nor necessarily covered * determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when all transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost. * reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways * the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC This commit was SVN r29058.	2013-08-22 16:37:40 +00:00
Ralph Castain	bd65937bf3	If we enable ipv6, we resolve a hosts addresses and check them all against our local interfaces to determine if the given host is us. However, if we don't enable ipv6, we only checked the first address returned. This can cause us to incorrectly identify a hostname as "not us". Make -disable-ipv6 behave the same as --enable-ipv6 by checking all the returned addresses. This commit was SVN r28716.	2013-07-03 21:41:36 +00:00
Jeff Squyres	089c632cce	Remove a bunch of dead code: gcc 4.7 warns of set-but-unused variables. So get rid of them. This commit was SVN r28538.	2013-05-17 21:45:49 +00:00
Ralph Castain	37088f23d8	When ipv6 disabled, we still have getaddrinfo, so use it when checking common networks for resolving to kindex This commit was SVN r28496.	2013-05-14 15:54:46 +00:00
Ralph Castain	3fc1bafd82	fix typo This commit was SVN r28490.	2013-05-14 12:36:45 +00:00
Ralph Castain	f4f07bdb21	Ensure the opal_ifaddrtokindex function considers the full range of address space by using the netmask This commit was SVN r28487.	2013-05-14 03:37:44 +00:00
Ralph Castain	b73f25e839	Add a function to return the kernel index of the corresponding interface from an IPv4/6 string or hostname This commit was SVN r28397.	2013-04-25 19:40:34 +00:00
Ralph Castain	cef639f578	Ahem....cleanup a copy/paste error in naming of these functions This commit was SVN r28395.	2013-04-25 15:21:53 +00:00
Jeff Squyres	c722440411	Add public functions for retrieving the MAC and MTU (paired with r28344). This commit was SVN r28345. The following SVN revision numbers were found above: r28344 --> open-mpi/ompi@e88881c25f	2013-04-17 22:32:32 +00:00
Nathan Hjelm	365cf48db5	Update OPAL frameworks to use the MCA framework system. This commit was SVN r28239.	2013-03-27 21:11:47 +00:00
Ralph Castain	a4b6fb241f	Remove all remaining vestiges of the Windows integration This commit was SVN r28137.	2013-02-28 17:31:47 +00:00
Brian Barrett	29aaa21c5a	Fix some warnings when we don't have sockets or syslog This commit was SVN r27973.	2013-01-29 23:02:26 +00:00
Ralph Castain	fdf7633cff	Per Jeff's suggestion, set the default answer when asking for IP aliases in case we don't find any This commit was SVN r27620.	2012-11-16 14:28:30 +00:00
Ralph Castain	a52071a17d	Add a function to return the aliases (based on IP addrs) for the current node This commit was SVN r27618.	2012-11-16 04:02:29 +00:00
Ralph Castain	4c06c9c07c	Simplify the code a little bit by recognizing that end=start isn't an error, but just indicates a partial address typical of CIDR notation. This commit was SVN r24757.	2011-06-07 11:33:22 +00:00
Ralph Castain	666fdeab8f	Okay to return an error on end=start of string conversion so long as the strlen > 0, so restore that error check. This commit was SVN r24756.	2011-06-07 03:20:01 +00:00
Ralph Castain	f3cae3d6f3	Cleanup the handling of if_include and if_exclude arguments based on CIDR notation. Fix a bug in the new code that prevented the system from correctly matching addresses. Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide. Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result. Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages. NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing. This commit was SVN r24755.	2011-06-07 02:09:11 +00:00
George Bosilca	7ebd094ecf	Cleanup the IPv4 address parsing, and correct the error message. This commit was SVN r24750.	2011-06-06 03:08:02 +00:00
Ralph Castain	1491d52bd7	Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3"). This commit was SVN r24747.	2011-06-05 19:16:42 +00:00
George Bosilca	34abbce82c	More accurate and trustworthy descriptions of the netmask exist. Interested readers can quench their curiosity either with one of the Richard Stevens books (ISBN 9780201633467) or the Wikipedia page (http://en.wikipedia.org/wiki/Subnetwork). This commit was SVN r24680.	2011-05-03 21:59:51 +00:00
Ralph Castain	257473ebca	Remove an extra "break" - thanks to Rainer for pointing it out. This commit was SVN r24667.	2011-05-02 12:20:37 +00:00
Ralph Castain	7b29a6153e	Cover all the netmask values This commit was SVN r24665.	2011-04-29 17:56:15 +00:00
Ralph Castain	ac1853b5d8	Took me a couple of days, but finally tracked this one down. Some compilers/glibc's don't like composite test statements in a return and just randomly pick one of the two options. So....don't do that!!! This commit was SVN r24212.	2011-01-10 16:29:42 +00:00
Ralph Castain	3631e4e936	Revert remaining svn kruft from r23764 This commit was SVN r23786. The following SVN revision numbers were found above: r23764 --> open-mpi/ompi@40a2bfa238	2010-09-22 01:11:40 +00:00
Ralph Castain	40a2bfa238	WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues. This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change. Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation. This commit was SVN r23764.	2010-09-17 23:04:06 +00:00
Ralph Castain	e96b5f486f	Reorganize the opal interface code in opal/util/if.c per prior emails and telecon discussions. Move the interface discovery code into a framework so that configuration logic can separate it out (instead of the prior #if-#else confusion). All interface APIs for accessing the info remain unchanged in opal/util/if.c. This has been tested on Mac, Linux, and NetBSD. Nobody else seemed interested in testing it, so there may be some future problems revealed as people try it on other OSs. This commit was SVN r23743.	2010-09-13 01:58:51 +00:00
Ralph Castain	51833bfe6c	Not -everyone- wants to ignore loopback devices. Give us a choice. This commit was SVN r23637.	2010-08-24 02:37:05 +00:00
Jeff Squyres	245dc1a86d	Add a cast to avoid a compiler warnings on BSD. This commit was SVN r23502.	2010-07-27 14:14:37 +00:00
Jeff Squyres	0ce1a82cde	This commit looks much bigger than it is. There are only 2 substantive changes in this commit; the rest are minor style changes: 1. Change an OBJ_NEW(opal_list_item_t) to OBJ_NEW(opal_if_t). This was causing memory corruption in the BSD code paths. 1. Move some local variables from the top of opal_if_init() to inside the non-BSD code paths so that we avoid bunches of warnings about unused variables when compiling on BSD. In doing so, I indented the whole non-BSD section one level deeper, making the commit look huge. I also added a few {} around 1-line blocks, added some spaces, broke a few lines, re-formatted a few comments, ...etc. Trivial stuff. This commit was SVN r23501.	2010-07-27 13:46:55 +00:00
Jeff Squyres	c26dae01ce	Update the if.c code to properly use the OBJ_* system. This commit was SVN r22869.	2010-03-23 20:37:06 +00:00
Jeff Squyres	17f0885f12	Add proper BSD interface detection code. Fixes a long-standing discussion on the users list (see http://www.open-mpi.org/community/lists/users/2009/12/11526.php). Many thanks to Kevin Buckley who did most of the coding work, and to Aleksej Saushev for his extreme patience in waiting for me to review and commit this stuff. This commit was SVN r22640.	2010-02-17 19:43:57 +00:00
Brian Barrett	86d8356b13	Updates to allow OMPI to build on Cray XT platforms running Catamount This commit was SVN r22381.	2010-01-07 18:14:03 +00:00
Ralph Castain	c58a30ea10	Add two new functions: 1. check for loopback interface 2. convert tuple addresses to ip addrs + mask This commit was SVN r22080.	2009-10-09 15:24:41 +00:00
Shiqing Fan	0e09cb650e	The kernel index of the network interface wasn't set on Windows, it really caused a lot of problems. This commit was SVN r21587.	2009-07-02 14:44:41 +00:00
Rainer Keller	221fb9dbca	... Delayed due to notifier commits earlier this day ... - Delete unnecessary header files using contrib/check_unnecessary_headers.sh after applying patches, that include headers, being "lost" due to inclusion in one of the now deleted headers... In total 817 files are touched. In ompi/mpi/c/ header files are moved up into the actual c-file, where necessary (these are the only additional #include), otherwise it is only deletions of #include (apart from the above additions required due to notifier...) - To get different MCAs (OpenIB, TM, ALPS), an earlier version was successfully compiled (yesterday) on: Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled This commit was SVN r21096.	2009-04-29 01:32:14 +00:00
Jeff Squyres	0bd9ef0bb9	Some valgrind-clean fixes. Thanks to "Number Cruncher" on the devel list for pointing these out. This commit was SVN r21060.	2009-04-23 18:50:46 +00:00
Nysal Jan	ab18a3629f	Change the return type to handle the case where an invalid interface name is passed to this function. This commit was SVN r20933.	2009-04-02 18:35:09 +00:00
Jeff Squyres	2815cb88b4	Fixes trac:1836: no reason to constrain the latter numbers to 2 hex digits. They likely shouldn't be more than 2 digits anyway, but let's be social just in case they are (e.g., https://bugs.openfabrics.org/show_bug.cgi?id=1544). This commit was SVN r20824. The following Trac tickets were found above: Ticket 1836 --> https://svn.open-mpi.org/trac/ompi/ticket/1836	2009-03-18 14:43:00 +00:00
Rolf vandeVaart	cad49da72d	Fix the tcp btl so it makes use of the btl_tcp_if_include and btl_tcp_if_exclude parameters on the connecting side also. Also move define of IF_NAMESIZE into if.h file. And lastly, add one verbose debug message which may be useful if we run into other issues like this. This commit fixes trac:1573. This commit was SVN r19932. The following Trac tickets were found above: Ticket 1573 --> https://svn.open-mpi.org/trac/ompi/ticket/1573	2008-11-05 18:45:42 +00:00
Shiqing Fan	a456c057d6	- Skip the loopback address on windows. This commit was SVN r19862.	2008-10-31 17:02:41 +00:00
Ralph Castain	e7487ad533	Implement the seq rmaps module that sequentially maps process ranks to a list hosts in a hostfile. Restore the "do-not-launch" functionality so users can test a mapping without launching it. Add a "do-not-resolve" cmd line flag to mpirun so the opal/util/if.c code does not attempt to resolve network addresses, thus enabling a user to test a hostfile mapping without hanging on network resolve requests. Add a function to hostfile to generate an ordered list of host names from a hostfile This commit was SVN r18190.	2008-04-17 13:50:59 +00:00
Adrian Knoth	601fb4389d	Cosmetics for r17150. Closes trac:1201 This commit was SVN r17151. The following SVN revision numbers were found above: r17150 --> open-mpi/ompi@4b50f02126 The following Trac tickets were found above: Ticket 1201 --> https://svn.open-mpi.org/trac/ompi/ticket/1201	2008-01-17 12:29:12 +00:00
Adrian Knoth	4b50f02126	Only free res iff it's been allocated before. Re #1201 This patch fixes the segfault, so closing the ticket might be possible. It's a very conservative patch. Perhaps the freeaddrinfo spec says that it will never allocate res in case of errors, but for now, I neither have the spec nor the will to rely on it. This commit was SVN r17150.	2008-01-17 10:01:52 +00:00
George Bosilca	921d79c2b8	Remove few memory leaks. Close the files where we're done with them. This commit was SVN r16125.	2007-09-14 02:06:26 +00:00
Shiqing Fan	a389e61330	- Add some type casts, required by MS compiler. This commit was SVN r16085.	2007-09-11 09:32:11 +00:00
Brian Barrett	951755f9fb	no need to call gethostname twice to determine if a process is local This commit was SVN r15742.	2007-08-02 16:25:25 +00:00
Brian Barrett	c3be7376c5	* Mark some of the structures passed into the if and net code as const when they actually are const. * Remove some dead code from the no IP support case * Add doxygen comment for opal_net_get_port() This commit was SVN r15547.	2007-07-22 19:19:01 +00:00

1 2

90 Коммитов