the multiple threads accessing the OOB/registry asynchronously via the
callbacks. The quickest solution (but definitively not the cleanest) is
to serialize these callbacks in such a way that at any given time
only one thread can execute a callbacks.
This commit was SVN r15086.
through the win dll using multiple threads, we have to insure that
the oob callbacks happens only in a synchronous way or really bad
things happens with the current design (blocking messages from a receive
callback).
This commit was SVN r15069.
single threaded builds. In its default configuration, all this does
is ensure that there's at least a good chance of threads building
based on non-threaded development (since the variable names will be
checked). There is also code to make sure that a "mutex" is never
"double locked" when using the conditional macro mutex operations.
This is off by default because there are a number of places in both
ORTE and OMPI where this alarm spews mega bytes of errors on a
simple test. So we have some work to do on our path towards
thread support.
Also removed the macro versions of the non-conditional thread locks,
as the only places they were used, the author of the code intended
to use the conditional thread locks. So now you have upper-case
macros for conditional thread locks and lowercase functions for
non-conditional locks. Simple, right? :).
This commit was SVN r15011.
1. generalize orte_rml.xcast to become a general broadcast-like messaging system. Messages can now be sent to any tag on the daemons or processes. Note that any message sent via xcast will be delivered to ALL processes in the specified job - you don't get to pick and choose. At a later date, we will introduce an augmented capability that will use the daemons as relays, but will allow you to send to a specified array of process names.
2. extended orte_rml.xcast so it supports more scalable message routing methodologies. At the moment, we support three: (a) direct, which sends the message directly to all recipients; (b) linear, which sends the message to the local daemon on each node, which then relays it to its own local procs; and (b) binomial, which sends the message via a binomial algo across all the daemons, each of which then relays to its own local procs. The crossover points between the algos are adjustable via MCA param, or you can simply demand that a specific algo be used.
3. orteds no longer exhibit two types of behavior: bootproxy or VM. Orteds now always behave like they are part of a virtual machine - they simply launch a job if mpirun tells them to do so. This is another step towards creating an "orteboot" functionality, but also provided a clean system for supporting message relaying.
Note one major impact of this commit: multiple daemons on a node cannot be supported any longer! Only a single daemon/node is now allowed.
This commit is known to break support for the following environments: POE, Xgrid, Xcpu, Windows. It has been tested on rsh, SLURM, and Bproc. Modifications for TM support have been made but could not be verified due to machine problems at LANL. Modifications for SGE have been made but could not be verified. The developers for the non-verified environments will be separately notified along with suggestions on how to fix the problems.
This commit was SVN r15007.
structures in the system. Instead of using memcmp, use the ns function.
This won't cause a problem as long as all three elements of the name are
ints, but if they have different sizes, alignment and padding rules
can cause memcmp() to compare padding space, which rarely holds a sane
value.
This commit was SVN r14998.
generalized component include/exclude infrastructure. This commit
removes the oob_base_include and oob_base_exclude MCA params because
they have long-since been handled by the "oob" MCA parameter in the
MCA base.
This commit was SVN r14979.
Rename the oob_tcp_include and oob_tcp_exclude MCA parameters to be
oob_tcp_if_include and oob_tcp_if_exclude (to match the convention
with btl_tcp_if_[in|ex]clude). Keep "hidden" synonyms oob_tcp_include
and oob_tcp_exclude in case anyone is actually using them (and some
users undoubtedly are), but do not have them show up in ompi_info
--param output. Instead, the new "oob_tcp_if_*" names will show up in
ompi_info output.
This commit was SVN r14746.
The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
* Move ipv6comat.h code into opal_config_bottom.h and change into some
more intelligent testing of structures
* Change opal's if interface to use sockaddr instead of sockaddr_storage,
as the RFCs suggest we do
* Move the networking code in opal that isn't directly related to if
detection into net.h
* Add quicky function to get the port out of either a sockaddr_in
or sockaddr_in6, saving a bunch of code in the oob.
* Update TCP oob and btl with new interface
This commit was SVN r14679.
assumptions in the FT restart code for the ORTE layer.
This fixes those problems by having the RML completely shutdown and
restart the OOB framework (instead of just the module as before).
This makes it much easier to manage, and maintainable as the OOB
changes in the future.
The SDS now does communication as part of its startup procedure, so
we need to make sure we restart the RML before the SDS so that it can
communicate properly.
OOB base [close|open] used a static bool to determine if they have
been called previously or not. I needed to expose this boolean so
that I can close() then open() the oob base in the restart procedure.
The functionality has not changed, we just now have the ability to
open/close the framework as many times as we need to as long as we
always call them in that order. (So calling open twice in a row is not allowed
as before, it is only allowed if you open(), close(), then open() again).
Things seem to be working now.
This commit was SVN r14515.
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
of #ifs in the OOOB code.
- Fix a compiler warning in the TCP BTL due to run-time determined
array size by making it a dynamicly allocated array.
- Fix the unpacking code of IPv4 addresses when using IPv6 support, so
that the address is in the correct location (instead of in an IPv6
structure, use an IPv4 structure). Refs trac:1005.
This commit was SVN r14514.
The following Trac tickets were found above:
Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).
This also involved a slight change to the oob.xcast API, so propagated that as required.
Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)
Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.
This commit was SVN r14475.
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).
This commit was SVN r14345.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r14289
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
OPAL_FREE_LIST_WAIT/RETURN will not use locks in a non-threaded build
conditionaly use locks if non-threaded around the OPAL_FREE_LIST_WAIT/RETURN
seems to fix the issue
Tested at 4K processes and seems to work..
This commit was SVN r14135.
listen thread, but we're not the HNP. This is better than not starting up
any listen mode, which is what we were doing before :/
This commit was SVN r14133.
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.
This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.
This commit closes trac:158
More details to follow.
This commit was SVN r14051.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r13912
The following Trac tickets were found above:
Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
its bigger than the timeout for the connect() call, just don't register
the handler by default and fall back to connect() timing out. Should give
much happier performance on big clusters.
This commit was SVN r13639.