1
1

99 Коммитов

Автор SHA1 Сообщение Дата
Geoff Paulsen
56a55cdb4b
Merge pull request #5786 from jsquyres/pr/string-madness
Replace strcpy() and strncpy() with (new) opal_string_copy()
2018-10-04 16:12:46 -05:00
Brian Barrett
a25df3f29e opal: Remove outdated MacOS workaround
Remove the pack/unpack pragma around net/if.h on MacOS, which
was added to fix a bug in MacOS X 10.4.x on 64-bit platforms.
The bug was fixed in Mac OS X 10.5.0 and, sometime in the last
11 years, compilers started emitting warnings about the fact
that the Apple header stomped over the pragma pack settings
from the workaround.  We already don't support versions of MacOS
earlier than 10.5, so there's no point in keeping the workaround.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
2018-10-02 13:35:46 -04:00
Jeff Squyres
28e51ad4d4 opal: convert from strncpy() -> opal_string_copy()
In many cases, this was a simple string replace.  In a few places, it
entailed:

1. Updating some comments and removing now-redundant foo[size-1]='\0'
   statements.
2. Updating passing (size-1) to (size) (because opal_string_copy()
   wants the entire destination buffer length).

This commit actually fixes a bunch of potential (yet quite unlikely)
bugs where we could have ended up with non-null-terminated strings.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
2018-09-27 11:56:18 -07:00
Wojtek Wasko
276de13a1e Make interface's kernel index an int instead of int16_t
Sometimes, the ethernet interfaces can get quite high kernel indices. struct
ifreq (see netdevice(7)) defines ifr_ifindex to be int's. The OOB component
used int16_t internally for matching (in case of -mca oob_tcp_if_[in|ex]clude)
which meant that any interface index > 32767 would never be matched because the
integer would be truncated to int16_t upon return from the function. OOB would
then refuse to work because it didn't find any usable interfaces and MPI job
would abort.

Signed-off-by: Wojtek Wasko <wwasko@nvidia.com>
2017-11-15 04:32:26 -05:00
Nathan Hjelm
ffd8ee2dfd opal: use opal_list_t convienience macros
This commit cleans up code in opal to use OPAL_LIST_FOREACH(_SAFE),
OPAL_LIST_DESTRUCT, and OPAL_LIST_RELEASE.

Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>
2017-06-20 12:37:12 -06:00
Gilles Gouaillardet
c9aeccb84e opal/if: open the if framework once in opal_init_util
the if framework is no more open in opal_if*, which plugs
several memory leaks

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
2016-12-01 14:24:30 +09:00
Gilles Gouaillardet
dc883cff8d opal/util: fix parse_ipv4_dots prototype 2015-10-01 14:03:08 +09:00
Ralph Castain
869041f770 Purge whitespace from the repo 2015-06-23 20:59:57 -07:00
Jeff Squyres
a1037cd70a if.c: fix minor memory leak
This was CID 1269846.
2015-02-12 13:41:29 -08:00
Nathan Hjelm
1d1cef76df opal: fix leaks
Two leaks are fixed by this commit:

 - opal_dss.lookup_data_type returns an allocated string. Free it.

 - opal_ifaddrtokindex was leaking a struct addrinfo. Ensure that is
   released before returning.

cmr=v1.8.2:reviewer=rhc

This commit was SVN r31777.
2014-05-15 15:59:41 +00:00
Jeff Squyres
090ce4187a Fix compiler errors on Solaris, NetBSD, and OpenBSD:
* Per
   http://www.open-mpi.org/community/lists/devel/2013/12/13504.php, 
   protect usage of struct ifreq->ifr_hwaddr
 * Per
   http://www.open-mpi.org/community/lists/devel/2013/12/13503.php,
   avoid #define conflict with the token "if_mtu"
 * Also fix some whitespace and string naming issues in opal/util/if.c

Tested by Paul Hargrove.

Refs trac:4010

This commit was SVN r30006.

The following Trac tickets were found above:
  Ticket 4010 --> https://svn.open-mpi.org/trac/ompi/ticket/4010
2013-12-20 11:17:30 +00:00
Ralph Castain
8c5c7d0db4 Correct a bug in handling of oob_tcp_if_include/exclude addresses by using the kernel index instead of the raw index of the interface.
Refs trac:3696

This commit was SVN r29522.

The following Trac tickets were found above:
  Ticket 3696 --> https://svn.open-mpi.org/trac/ompi/ticket/3696
2013-10-26 00:47:14 +00:00
Ralph Castain
a200e4f865 As per the RFC, bring in the ORTE async progress code and the rewrite of OOB:
*** THIS RFC INCLUDES A MINOR CHANGE TO THE MPI-RTE INTERFACE ***

Note: during the course of this work, it was necessary to completely separate the MPI and RTE progress engines. There were multiple places in the MPI layer where ORTE_WAIT_FOR_COMPLETION was being used. A new OMPI_WAIT_FOR_COMPLETION macro was created (defined in ompi/mca/rte/rte.h) that simply cycles across opal_progress until the provided flag becomes false. Places where the MPI layer blocked waiting for RTE to complete an event have been modified to use this macro.

***************************************************************************************

I am reissuing this RFC because of the time that has passed since its original release. Since its initial release and review, I have debugged it further to ensure it fully supports tests like loop_spawn. It therefore seems ready for merge back to the trunk. Given its prior review, I have set the timeout for one week.

The code is in  https://bitbucket.org/rhc/ompi-oob2


WHAT:    Rewrite of ORTE OOB

WHY:       Support asynchronous progress and a host of other features

WHEN:    Wed, August 21

SYNOPSIS:
The current OOB has served us well, but a number of limitations have been identified over the years. Specifically:

* it is only progressed when called via opal_progress, which can lead to hangs or recursive calls into libevent (which is not supported by that code)

* we've had issues when multiple NICs are available as the code doesn't "shift" messages between transports - thus, all nodes had to be available via the same TCP interface.

* the OOB "unloads" incoming opal_buffer_t objects during the transmission, thus preventing use of OBJ_RETAIN in the code when repeatedly sending the same message to multiple recipients

* there is no failover mechanism across NICs - if the selected NIC (or its attached switch) fails, we are forced to abort

* only one transport (i.e., component) can be "active"


The revised OOB resolves these problems:

* async progress is used for all application processes, with the progress thread blocking in the event library

* each available TCP NIC is supported by its own TCP module. The ability to asynchronously progress each module independently is provided, but not enabled by default (a runtime MCA parameter turns it "on")

* multi-address TCP NICs (e.g., a NIC with both an IPv4 and IPv6 address, or with virtual interfaces) are supported - reachability is determined by comparing the contact info for a peer against all addresses within the range covered by the address/mask pairs for the NIC.

* a message that arrives on one TCP NIC is automatically shifted to whatever NIC that is connected to the next "hop" if that peer cannot be reached by the incoming NIC. If no TCP module will reach the peer, then the OOB attempts to send the message via all other available components - if none can reach the peer, then an "error" is reported back to the RML, which then calls the errmgr for instructions.

* opal_buffer_t now conforms to standard object rules re OBJ_RETAIN as we no longer "unload" the incoming object

* NIC failure is reported to the TCP component, which then tries to resend the message across any other available TCP NIC. If that doesn't work, then the message is given back to the OOB base to try using other components. If all that fails, then the error is reported to the RML, which reports to the errmgr for instructions

* obviously from the above, multiple OOB components (e.g., TCP and UD) can be active in parallel

* the matching code has been moved to the RML (and out of the OOB/TCP component) so it is independent of transport

* routing is done by the individual OOB modules (as opposed to the RML). Thus, both routed and non-routed transports can simultaneously be active

* all blocking send/recv APIs have been removed. Everything operates asynchronously.


KNOWN LIMITATIONS:

* although provision is made for component failover as described above, the code for doing so has not been fully implemented yet. At the moment, if all connections for a given peer fail, the errmgr is notified of a "lost connection", which by default results in termination of the job if it was a lifeline

* the IPv6 code is present and compiles, but is not complete. Since the current IPv6 support in the OOB doesn't work anyway, I don't consider this a blocker

* routing is performed at the individual module level, yet the active routed component is selected on a global basis. We probably should update that to reflect that different transports may need/choose to route in different ways

* obviously, not every error path has been tested nor necessarily covered

* determining abnormal termination is more challenging than in the old code as we now potentially have multiple ways of connecting to a process. Ideally, we would declare "connection failed" when *all* transports can no longer reach the process, but that requires some additional (possibly complex) code. For now, the code replicates the old behavior only somewhat modified - i.e., if a module sees its connection fail, it checks to see if it is a lifeline. If so, it notifies the errmgr that the lifeline is lost - otherwise, it notifies the errmgr that a non-lifeline connection was lost.

* reachability is determined solely on the basis of a shared subnet address/mask - more sophisticated algorithms (e.g., the one used in the tcp btl) are required to handle routing via gateways

* the RML needs to assign sequence numbers to each message on a per-peer basis. The receiving RML will then deliver messages in order, thus preventing out-of-order messaging in the case where messages travel across different transports or a message needs to be redirected/resent due to failure of a NIC

This commit was SVN r29058.
2013-08-22 16:37:40 +00:00
Ralph Castain
bd65937bf3 If we enable ipv6, we resolve a hosts addresses and check them all against our local interfaces to determine if the given host is us. However, if we don't enable ipv6, we only checked the first address returned. This can cause us to incorrectly identify a hostname as "not us".
Make -disable-ipv6 behave the same as --enable-ipv6 by checking all the returned addresses.

This commit was SVN r28716.
2013-07-03 21:41:36 +00:00
Jeff Squyres
089c632cce Remove a bunch of dead code: gcc 4.7 warns of set-but-unused
variables.  So get rid of them.

This commit was SVN r28538.
2013-05-17 21:45:49 +00:00
Ralph Castain
37088f23d8 When ipv6 disabled, we still have getaddrinfo, so use it when checking common networks for resolving to kindex
This commit was SVN r28496.
2013-05-14 15:54:46 +00:00
Ralph Castain
3fc1bafd82 fix typo
This commit was SVN r28490.
2013-05-14 12:36:45 +00:00
Ralph Castain
f4f07bdb21 Ensure the opal_ifaddrtokindex function considers the full range of address space by using the netmask
This commit was SVN r28487.
2013-05-14 03:37:44 +00:00
Ralph Castain
b73f25e839 Add a function to return the kernel index of the corresponding interface from an IPv4/6 string or hostname
This commit was SVN r28397.
2013-04-25 19:40:34 +00:00
Ralph Castain
cef639f578 Ahem....cleanup a copy/paste error in naming of these functions
This commit was SVN r28395.
2013-04-25 15:21:53 +00:00
Jeff Squyres
c722440411 Add public functions for retrieving the MAC and MTU (paired with
r28344).

This commit was SVN r28345.

The following SVN revision numbers were found above:
  r28344 --> open-mpi/ompi@e88881c25f
2013-04-17 22:32:32 +00:00
Nathan Hjelm
365cf48db5 Update OPAL frameworks to use the MCA framework system.
This commit was SVN r28239.
2013-03-27 21:11:47 +00:00
Ralph Castain
a4b6fb241f Remove all remaining vestiges of the Windows integration
This commit was SVN r28137.
2013-02-28 17:31:47 +00:00
Brian Barrett
29aaa21c5a Fix some warnings when we don't have sockets or syslog
This commit was SVN r27973.
2013-01-29 23:02:26 +00:00
Ralph Castain
fdf7633cff Per Jeff's suggestion, set the default answer when asking for IP aliases in case we don't find any
This commit was SVN r27620.
2012-11-16 14:28:30 +00:00
Ralph Castain
a52071a17d Add a function to return the aliases (based on IP addrs) for the current node
This commit was SVN r27618.
2012-11-16 04:02:29 +00:00
Ralph Castain
4c06c9c07c Simplify the code a little bit by recognizing that end=start isn't an error, but just indicates a partial address typical of CIDR notation.
This commit was SVN r24757.
2011-06-07 11:33:22 +00:00
Ralph Castain
666fdeab8f Okay to return an error on end=start of string conversion so long as the strlen > 0, so restore that error check.
This commit was SVN r24756.
2011-06-07 03:20:01 +00:00
Ralph Castain
f3cae3d6f3 Cleanup the handling of if_include and if_exclude arguments based on CIDR notation.
Fix a bug in the new code that prevented the system from correctly matching addresses.

Remove comments in the show-help text indicating that we would continue in the face of incorrect specifications - leave that to the calling layer to decide.

Modify the new opal_ifmatches so it returns error codes letting the caller better understand the result.

Modify the oob to ensure we abort if we don't find interfaces matching specified constraints, and that we do so without multiple error messages.

NOTE: we have a conflict in our standards. We have been using comma-delimited lists of interfaces for all our params. However, one param - opal_net_private_ipv4 - now uses semicolons instead of comma separators. No idea why, but it is confusing.

This commit was SVN r24755.
2011-06-07 02:09:11 +00:00
George Bosilca
7ebd094ecf Cleanup the IPv4 address parsing, and correct the error message.
This commit was SVN r24750.
2011-06-06 03:08:02 +00:00
Ralph Castain
1491d52bd7 Extend the parsing capability of the oob tcp module's if_include and if_exclude options to support subnet+mask notation, and to handle virtual IP addresses (it was previously having problems distinguishing between "eth1" and "eth1.3").
This commit was SVN r24747.
2011-06-05 19:16:42 +00:00
George Bosilca
34abbce82c More accurate and trustworthy descriptions of the netmask exist.
Interested readers can quench their curiosity either with one
of the Richard Stevens books (ISBN 9780201633467) or the
Wikipedia page (http://en.wikipedia.org/wiki/Subnetwork).

This commit was SVN r24680.
2011-05-03 21:59:51 +00:00
Ralph Castain
257473ebca Remove an extra "break" - thanks to Rainer for pointing it out.
This commit was SVN r24667.
2011-05-02 12:20:37 +00:00
Ralph Castain
7b29a6153e Cover all the netmask values
This commit was SVN r24665.
2011-04-29 17:56:15 +00:00
Ralph Castain
ac1853b5d8 Took me a couple of days, but finally tracked this one down. Some compilers/glibc's don't like composite test statements in a return and just randomly pick one of the two options.
So....don't do that!!!

This commit was SVN r24212.
2011-01-10 16:29:42 +00:00
Ralph Castain
3631e4e936 Revert remaining svn kruft from r23764
This commit was SVN r23786.

The following SVN revision numbers were found above:
  r23764 --> open-mpi/ompi@40a2bfa238
2010-09-22 01:11:40 +00:00
Ralph Castain
40a2bfa238 WARNING: Work on the temp branch being merged here encountered problems with bugs in subversion. Considerable effort has gone into validating the branch. However, not all conditions can be checked, so users are cautioned that it may be advisable to not update from the trunk for a few days to allow MTT to identify platform-specific issues.
This merges the branch containing the revamped build system based around converting autogen from a bash script to a Perl program. Jeff has provided emails explaining the features contained in the change.

Please note that configure requirements on components HAVE CHANGED. For example. a configure.params file is no longer required in each component directory. See Jeff's emails for an explanation.

This commit was SVN r23764.
2010-09-17 23:04:06 +00:00
Ralph Castain
e96b5f486f Reorganize the opal interface code in opal/util/if.c per prior emails and telecon discussions. Move the interface discovery code into a framework so that configuration logic can separate it out (instead of the prior #if-#else confusion).
All interface APIs for accessing the info remain unchanged in opal/util/if.c.

This has been tested on Mac, Linux, and NetBSD. Nobody else seemed interested in testing it, so there may be some future problems revealed as people try it on other OSs.

This commit was SVN r23743.
2010-09-13 01:58:51 +00:00
Ralph Castain
51833bfe6c Not -everyone- wants to ignore loopback devices. Give us a choice.
This commit was SVN r23637.
2010-08-24 02:37:05 +00:00
Jeff Squyres
245dc1a86d Add a cast to avoid a compiler warnings on BSD.
This commit was SVN r23502.
2010-07-27 14:14:37 +00:00
Jeff Squyres
0ce1a82cde This commit looks much bigger than it is. There are only 2
substantive changes in this commit; the rest are minor style changes:

 1. Change an OBJ_NEW(opal_list_item_t) to OBJ_NEW(opal_if_t).  This
    was causing memory corruption in the BSD code paths.
 1. Move some local variables from the top of opal_if_init() to inside
    the non-BSD code paths so that we avoid bunches of warnings about
    unused variables when compiling on BSD.  In doing so, I indented
    the whole non-BSD section one level deeper, making the commit look
    huge. 

I also added a few {} around 1-line blocks, added some spaces, broke a
few lines, re-formatted a few comments, ...etc.  Trivial stuff.

This commit was SVN r23501.
2010-07-27 13:46:55 +00:00
Jeff Squyres
c26dae01ce Update the if.c code to properly use the OBJ_* system.
This commit was SVN r22869.
2010-03-23 20:37:06 +00:00
Jeff Squyres
17f0885f12 Add proper BSD interface detection code. Fixes a long-standing
discussion on the users list (see
http://www.open-mpi.org/community/lists/users/2009/12/11526.php). 

Many thanks to Kevin Buckley who did most of the coding work, and to
Aleksej Saushev for his extreme patience in waiting for me to review
and commit this stuff.

This commit was SVN r22640.
2010-02-17 19:43:57 +00:00
Brian Barrett
86d8356b13 Updates to allow OMPI to build on Cray XT platforms running Catamount
This commit was SVN r22381.
2010-01-07 18:14:03 +00:00
Ralph Castain
c58a30ea10 Add two new functions:
1. check for loopback interface

2. convert tuple addresses to ip addrs + mask

This commit was SVN r22080.
2009-10-09 15:24:41 +00:00
Shiqing Fan
0e09cb650e The kernel index of the network interface wasn't set on Windows, it really caused a lot of problems.
This commit was SVN r21587.
2009-07-02 14:44:41 +00:00
Rainer Keller
221fb9dbca ... Delayed due to notifier commits earlier this day ...
- Delete unnecessary header files using
   contrib/check_unnecessary_headers.sh after applying
   patches, that include headers, being "lost" due to
   inclusion in one of the now deleted headers...

   In total 817 files are touched.
   In ompi/mpi/c/ header files are moved up into the actual c-file,
   where necessary (these are the only additional #include),
   otherwise it is only deletions of #include (apart from the above
   additions required due to notifier...)

 - To get different MCAs (OpenIB, TM, ALPS), an earlier version was
   successfully compiled (yesterday) on:
   Linux locally using intel-11, gcc-4.3.2 and gcc-SVN + warnings enabled
   Smoky cluster (x86-64 running Linux) using PGI-8.0.2 + warnings enabled
   Lens cluster (x86-64 running Linux) using Pathscale-3.2 + warnings enabled

This commit was SVN r21096.
2009-04-29 01:32:14 +00:00
Jeff Squyres
0bd9ef0bb9 Some valgrind-clean fixes. Thanks to "Number Cruncher" on the devel
list for pointing these out.

This commit was SVN r21060.
2009-04-23 18:50:46 +00:00
Nysal Jan
ab18a3629f Change the return type to handle the case where an invalid interface name is passed to this function.
This commit was SVN r20933.
2009-04-02 18:35:09 +00:00
Jeff Squyres
2815cb88b4 Fixes trac:1836: no reason to constrain the latter numbers to 2 hex
digits.  They likely shouldn't be more than 2 digits anyway, but let's
be social just in case they are (e.g.,
https://bugs.openfabrics.org/show_bug.cgi?id=1544).

This commit was SVN r20824.

The following Trac tickets were found above:
  Ticket 1836 --> https://svn.open-mpi.org/trac/ompi/ticket/1836
2009-03-18 14:43:00 +00:00