Ralph Castain
5db717f090
Some small leak cleanups
...
cmr=v1.8.3:reviewer=artpol
This commit was SVN r32358.
2014-07-30 15:46:02 +00:00
Adrian Reber
4aca7095dc
fix a syntax error in the FT code
...
This commit was SVN r32087.
2014-06-25 20:35:50 +00:00
Ralph Castain
e21bfeadcd
Now that the BTLs are moving down to OPAL and becoming available to ORTE, there no longer is a need/desire to push performance in the OOB/TCP component. So we don't need multiple modules driving NICs in parallel, and can drop all the complicated distribution logic. Fall back to the simplified single module model, but retain the ability to run that module in its own progress thread if so directed.
...
This should eliminate the connectivity issues that have been reported, and will make maintenance of this component much easier.
cmr=v1.8.2:reviewer=jsquyres:subject=simplify the OOB/TCP component
This commit was SVN r31956.
2014-06-06 02:24:17 +00:00
Ralph Castain
7df500ecf5
Break the loop caused by retrying to send a message to a hop that is unknown by the TCP oob component. We attempt to provide a way for other components to try, but need to mark that the TCP component is not able to reach that process so the OOB base will know to give up.
...
This commit was SVN r31928.
2014-06-02 15:00:33 +00:00
Nathan Hjelm
59d09ad9de
orte: fix several small memory leaks
...
grpcomm: fix memory leaks
We were leaking the caddy object used to pass data to the callback
function. This commit fixes these leaks.
oob,rml: fix memory leaks
This commit fixes several leaks:
- Both the oob/base and oob/tcp were leaking objects on their peer
hash tables. Iterate on the hash tables and free any objects.
- Leaked sent messages because of missing OBJ_RELEASE. I placed the
release in ORTE_RML_SEND_COMPLETE to catch all the possible
paths.
ess/base: close the state framework
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31776.
2014-05-15 15:06:27 +00:00
Ralph Castain
ad0e8f841d
Just pick a module to handle the incoming connection if no direct interface is identified. Siegmar hit it because his IP/netmask is disjoint, but a router was able to make the connection.
...
Refs trac:4627
This commit was SVN r31763.
The following Trac tickets were found above:
Ticket 4627 --> https://svn.open-mpi.org/trac/ompi/ticket/4627
2014-05-14 19:23:02 +00:00
Ralph Castain
e605e73379
Close the incoming socket if we aren't going to accept it
...
cmr=v1.8.2:reviewer=rhc
This commit was SVN r31759.
2014-05-14 16:51:59 +00:00
Ralph Castain
5602156a1c
Use the correct abstraction layer name for the data dirs
...
This commit was SVN r31684.
2014-05-08 14:32:24 +00:00
Ralph Castain
11faab1091
The final step of the RFC: convert the <foo>libdir and friends to fit their respective code areas, and equate them all at the top. Note that we can't entirely separate things as the opal_install_dirs framework can't handle separated locations for the various trees.
...
This commit was SVN r31679.
2014-05-08 02:01:35 +00:00
Ralph Castain
445b552d3a
Try again to get an error message printed when a daemon fails to successfully report back to mpirun. In this case, there is no guaranteed way for the daemon to output the error report itself - we don't have a connection back to the HNP, and we have tied stderr off to /dev/null (for good reasons). So the HNP has to detect the failure itself and report it.
...
The HNP can't know the precise reason, of course - all it knows is that the daemon failed. So output a generic error message that provides guidance on probable causes.
Refs trac:4571
This commit was SVN r31589.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-05-01 19:48:21 +00:00
Ralph Castain
c4c9bc1573
As per the RFC:
...
http://www.open-mpi.org/community/lists/devel/2014/04/14496.php
Revamp the opal database framework, including renaming it to "dstore" to reflect that it isn't a "database". Move the "db" framework to ORTE for now, soon to move to ORCM
This commit was SVN r31557.
2014-04-29 21:49:23 +00:00
Ralph Castain
3723b39f30
Ensure we don't silently fail when unable to make a connection - bark pleasantly first.
...
Refs trac:4571
This commit was SVN r31537.
The following Trac tickets were found above:
Ticket 4571 --> https://svn.open-mpi.org/trac/ompi/ticket/4571
2014-04-28 19:16:32 +00:00
Ralph Castain
d642babff6
Derived from patch provided by Artem, cleanup the "abnormal" code path for selecting TCP OOB modules to connect to a remote process. If we can't find a direct interface-to-address match, then assign all the provided addresses to the first available TCP module and let the normal failure process determine if the remote proc is truly reachable.
...
cmr=v1.8.2:reviewer=artpol:subject=fix abnormal code connection path in tcp oob
This commit was SVN r31536.
2014-04-28 19:05:14 +00:00
Jeff Squyres
d8715f1e3a
Close 3 more fd's that were leaking into child processes.
...
Child processes now look clean; I can't find any more fd's that are
leaking from the parent to children.
Refs trac:4550
This commit was SVN r31515.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 15:36:24 +00:00
Jeff Squyres
e1655ae68d
opal/util/fd.c: add new convenience function for setting FD_CLOEXEC
...
Paul Hargrove pointed out that Stevens tells us that we should
FD_GETFL before FD_SETFL. And so we shall.
Make a new convenience function to do this (opal_fd_set_cloexec()),
just so that we don't have to litter this 2-step process throughout
the code.
Refs trac:4550
This commit was SVN r31513.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-24 13:04:49 +00:00
Jeff Squyres
410f5bfb91
oob_tcp_listener.c: set both ends of this thread to be close-on-exec
...
This pipe is used to communicate between threads in this process.
Mark both fd as close-on-exec so that children don't inherit this
pipe.
Refs trac:4550
This commit was SVN r31512.
The following Trac tickets were found above:
Ticket 4550 --> https://svn.open-mpi.org/trac/ompi/ticket/4550
2014-04-23 21:46:41 +00:00
Ralph Castain
bbdbc5f8a8
Per suggestion from George, use a pipe for terminating the thread.
...
Refs trac:4510
This commit was SVN r31381.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-14 01:02:46 +00:00
Ralph Castain
2d8dff837c
Ensure we properly terminate the listening thread prior to exiting, but do so in a way that doesn't make us wait for select to timeout.
...
Refs trac:4510
This commit was SVN r31376.
The following Trac tickets were found above:
Ticket 4510 --> https://svn.open-mpi.org/trac/ompi/ticket/4510
2014-04-12 15:01:24 +00:00
Ralph Castain
9b30b2b783
Shave some time off of mpirun's operation by not waiting for the listener thread to terminate before exiting
...
cmr=v1.8.1:reviewer=rhc
This commit was SVN r31368.
2014-04-11 04:16:28 +00:00
Dave Goodell
5f3b81e291
oob: delete events when destroying a peer
...
Without this patch running ring_c with the usnic BTL under valgrind will
cause the orteds to segfault.
Reviewed-by: Jeff Squyres <jsquyres@cisco.com>
Reviewed-by: Ralph Castain <rhc@open-mpi.org>
cmr=v1.7.5:reviewer=ompi-rm1.7
This commit was SVN r31161.
2014-03-19 22:15:49 +00:00
Ralph Castain
2abed09d7c
Continue to resolve priority issues. Cleanup the case of forced termination in mpirun during launch processing by ensuring we can respond to socket closures, and ensuring that the remote daemons correctly close their sockets when terminating.
...
Jeff: please test a variety of conditions to ensure we get this right
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r31058.
2014-03-13 04:02:24 +00:00
Ralph Castain
a254d2db34
Silence warning when CR is not enabled
...
This commit was SVN r31025.
2014-03-12 13:47:03 +00:00
Adrian Reber
4512b3375e
OOB/TCP: wire up the existing ft_event() function
...
This commit was SVN r31022.
2014-03-12 12:47:20 +00:00
Ralph Castain
9793909988
Correct the constant we check for an error. Thanks to George for noticing it.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30949.
2014-03-06 02:21:27 +00:00
Ralph Castain
da4cb39683
If we can't find a route to communicate, emit an error message rather than just exiting with a non-zero status
...
cmr=v1.7.5:reviewer=jsquyres:subject=print error if cannot communicate
This commit was SVN r30922.
2014-03-04 04:57:53 +00:00
Ralph Castain
5520d6971b
We do have to track the origin of messages sent over usock as the daemon does route them back down, and we need to get the "sender" info correct. Also do a better job of dealing with simultaneous connections to avoid binding to a used socket.
...
Refs trac:4280
This commit was SVN r30781.
The following Trac tickets were found above:
Ticket 4280 --> https://svn.open-mpi.org/trac/ompi/ticket/4280
2014-02-20 17:27:05 +00:00
Adrian Reber
6b45d475e9
Fix compiler warnings when compiling with --with-ft
...
With enabled fault tolerance code different functions
are selected during compilation. Most of the ft
code is #ifdef'd out. This #ifdef's more code out
so that compiler warnings like
warning: unused variable 'item' [-Wunused-variable]
opal_list_item_t *item;
are removed.
This commit was SVN r30747.
2014-02-17 10:53:44 +00:00
Ralph Castain
14bb7a117c
Fix bugs in the oob base - ensure we get the components in high-to-low priority, and that we correctly track reachability via all components. Adjust the priority of the tcp component to leave headroom for others
...
Refs trac:267
This commit was SVN r30740.
The following Trac tickets were found above:
Ticket 267 --> https://svn.open-mpi.org/trac/ompi/ticket/267
2014-02-16 03:19:08 +00:00
Ralph Castain
3f9db36e0d
Make Jeff smile - pretty-up the indentation
...
Refs trac:4267
This commit was SVN r30733.
The following Trac tickets were found above:
Ticket 4267 --> https://svn.open-mpi.org/trac/ompi/ticket/4267
2014-02-14 23:25:48 +00:00
Ralph Castain
4e1c07cbf2
If we are given a TCP oob address that doesn't match any active module, it is still possible that we could route to the address if a router is in the system. No harm in trying, so arbitrarily pick the first connection in the active module list and assign the peer to it. If that module can't reach it, we'll follow the usual failover mechanism until finally concluding that nobody can get there.
...
cmr=v1.7.5:reviewer=jsquyres:subject=handle non-matching addresses
This commit was SVN r30719.
2014-02-13 23:37:22 +00:00
Ralph Castain
fc6101b508
Handle "localhost" better
...
Refs trac:4263
This commit was SVN r30702.
The following Trac tickets were found above:
Ticket 4263 --> https://svn.open-mpi.org/trac/ompi/ticket/4263
2014-02-12 20:30:39 +00:00
Ralph Castain
a8a9801a0b
Ensure an orted exits with non-zero status if it is unable to send a message. Add more diagnostic messages to the OOB set_addr code
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30701.
2014-02-12 19:44:01 +00:00
Ralph Castain
fa7b686ccc
Provide better messages when we don't find any included interfaces, and/or don't find any interfaces for use by OOB.
...
cmr=v1.7.5:reviewer=jsquyres
This commit was SVN r30675.
2014-02-11 19:29:03 +00:00
Ralph Castain
230336b6a8
Upgrade the security framework to avoid multiple hits against the global security server. Add support for future case where mpirun assings a global security credential for a given run, though we need to work out how to handle connect-accept from other mpirun's in that case. Remove a bunch of duplicate code in the OOB by consolidating the connection handshake code.
...
Refs trac:4221
This commit was SVN r30554.
The following Trac tickets were found above:
Ticket 4221 --> https://svn.open-mpi.org/trac/ompi/ticket/4221
2014-02-04 14:47:04 +00:00
Ralph Castain
5980b7e042
Add a security framework for authenticating connections - we will add LDAP, Kerberos, and Keystone support in the next month. For now, just put a placeholder "basic" module that does the minimum.
...
Wire the security check into ORTE's OOB handshake, and add a "version" check to ensure that both ends are from the same ORTE version. If not, report the mismatch and refuse the connection
Fixes trac:4171
cmr=v1.7.5:reviewer=jsquyres:subject=Add a security framework for authenticating connections
This commit was SVN r30551.
The following Trac tickets were found above:
Ticket 4171 --> https://svn.open-mpi.org/trac/ompi/ticket/4171
2014-02-04 01:38:45 +00:00
Ralph Castain
993198cfba
Fix lost message problem - if multiple messages are queued before the connection is formed, we lost all but the first one. Ensure that all messages get properly queued prior to completing the connection
...
cmr=v1.7.4:reviewer=jsquyres:subject=Fix lost message problem
This commit was SVN r30516.
2014-01-31 05:30:51 +00:00
Rolf vandeVaart
f7055de78e
Stop listening thread and wait for it to terminate.
...
This commit was SVN r30507.
2014-01-30 20:37:15 +00:00
Ralph Castain
db92ac3ce1
Cleanup role of aggregator relative to daemons
...
Refs trac:4176
This commit was SVN r30495.
The following Trac tickets were found above:
Ticket 4176 --> https://svn.open-mpi.org/trac/ompi/ticket/4176
2014-01-30 00:53:30 +00:00
Ralph Castain
956aab03a7
Track the origin of a message so it can be passed across transports
...
Refs trac:4184
This commit was SVN r30433.
The following Trac tickets were found above:
Ticket 4184 --> https://svn.open-mpi.org/trac/ompi/ticket/4184
2014-01-26 21:09:26 +00:00
Ralph Castain
657796f9e0
Revert r30327 - turns out it isn't quite right just yet. :-(
...
Closes trac:4138
This commit was SVN r30328.
The following SVN revision numbers were found above:
r30327 --> open-mpi/ompi@87d5f86025
The following Trac tickets were found above:
Ticket 4138 --> https://svn.open-mpi.org/trac/ompi/ticket/4138
2014-01-18 23:38:39 +00:00
Ralph Castain
87d5f86025
Enable use of unix domain sockets for local OOB communications, thereby removing the requirement for an active network interface when running strictly on a single node. Update the overall OOB system to support cross-transport movement of messages so that the OOB can move a received message to another transport for transmission.
...
cmr=v1.7.5:reviewer=jsquyres:subject=Enable use of unix domain sockets for local OOB communications
This commit was SVN r30327.
2014-01-18 21:36:49 +00:00
Brian Barrett
8b778903d8
Fix longstanding issue with our multi-project support. Rather than using
...
pkg{data,lib,includedir}, use our own ompi{data,lib,includedir}, which is
always set to {datadir,libdir,includedir}/openmpi. This will keep us from
having help files in prefix/share/open-rte when building without Open MPI,
but in prefix/share/openmpi when building with Open MPI.
This commit was SVN r30140.
2014-01-07 22:11:15 +00:00
Ralph Castain
85f2429819
Ensure the ipv6 lists get initialized and finalized
...
cmr=v1.7.4:reviewer=jsquyres
This commit was SVN r30081.
2013-12-24 17:24:39 +00:00
Ralph Castain
01ee5f380b
Remove debug - problem has been identified
...
Refs trac:4026
This commit was SVN r30075.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 15:22:18 +00:00
Jeff Squyres
ce02002a5e
Free minor memory leak / squash valgrind still-reachable warning.
...
cmr=v1.7.5:reviewer=rhc
This commit was SVN r30071.
2013-12-24 11:04:38 +00:00
Ralph Castain
38f46641ce
Ensure the recv handler has been initialized
...
Refs trac:4026
This commit was SVN r30068.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-24 06:09:45 +00:00
Ralph Castain
65228d3571
Don't use "size_t" for the nbytes field in the header - use uint32_t to ensure that ntohl/htonl correctly match it
...
Refs trac:4026
This commit was SVN r30062.
The following Trac tickets were found above:
Ticket 4026 --> https://svn.open-mpi.org/trac/ompi/ticket/4026
2013-12-23 21:39:49 +00:00
Ralph Castain
7d8c0459a4
Attempt to debug hang that is hitting some environments. Posting to 1.7.4 as a placeholder for the eventual solution
...
cmr=v1.7.4:reviewer=rhc
This commit was SVN r30060.
2013-12-23 19:57:05 +00:00
George Bosilca
24879f9def
Code cleanup while chasing valgrind complaints.
...
This commit was SVN r30048.
2013-12-21 23:28:14 +00:00
Ralph Castain
264150872b
Add a bunch of debug output to the OOB connection completion code so we can track down a handshake problem. Available in optimized builds as well as debug ones by setting -mca oob_base_verbose 10
...
No review will be required as this is just debug code for those helping us debug the 1.7.4 release candidates
cmr-=v1.7.4:reviewer=ompi-gk1.7
This commit was SVN r30043.
2013-12-21 16:09:26 +00:00