The primary change that underlies all this is in the OOB. Specifically, the problem in the code until now has been that the OOB attempts to resolve an address when we call the "send" to an unknown recipient. The OOB would then wait forever if that recipient never actually started (and hence, never reported back its OOB contact info). In the case of an orted that failed to start, we would correctly detect that the orted hadn't started, but then we would attempt to order all orteds (including the one that failed to start) to die. This would cause the OOB to "hang" the system.
Unfortunately, revising how the OOB resolves addresses introduced a number of additional problems. Specifically, and most troublesome, was the fact that comm_spawn involved the immediate transmission of the rendezvous point from parent-to-child after the child was spawned. The current code used the OOB address resolution as a "barrier" - basically, the parent would attempt to send the info to the child, and then "hold" there until the child's contact info had arrived (meaning the child had started) and the send could be completed.
Note that this also caused comm_spawn to "hang" the entire system if the child never started... The app-failed-to-start helped improve that behavior - this code provides additional relief.
With this change, the OOB will return an ADDRESSEE_UNKNOWN error if you attempt to send to a recipient whose contact info isn't already in the OOB's hash tables. To resolve comm_spawn issues, we also now force the cross-sharing of connection info between parent and child jobs during spawn.
Finally, to aid in setting triggers to the right values, we introduce the "arith" API for the GPR. This function allows you to atomically change the value in a registry location (either divide, multiply, add, or subtract) by the provided operand. It is equivalent to first fetching the value using a "get", then modifying it, and then putting the result back into the registry via a "put".
This commit was SVN r14711.
* Move ipv6comat.h code into opal_config_bottom.h and change into some
more intelligent testing of structures
* Change opal's if interface to use sockaddr instead of sockaddr_storage,
as the RFCs suggest we do
* Move the networking code in opal that isn't directly related to if
detection into net.h
* Add quicky function to get the port out of either a sockaddr_in
or sockaddr_in6, saving a bunch of code in the oob.
* Update TCP oob and btl with new interface
This commit was SVN r14679.
assumptions in the FT restart code for the ORTE layer.
This fixes those problems by having the RML completely shutdown and
restart the OOB framework (instead of just the module as before).
This makes it much easier to manage, and maintainable as the OOB
changes in the future.
The SDS now does communication as part of its startup procedure, so
we need to make sure we restart the RML before the SDS so that it can
communicate properly.
OOB base [close|open] used a static bool to determine if they have
been called previously or not. I needed to expose this boolean so
that I can close() then open() the oob base in the restart procedure.
The functionality has not changed, we just now have the ability to
open/close the framework as many times as we need to as long as we
always call them in that order. (So calling open twice in a row is not allowed
as before, it is only allowed if you open(), close(), then open() again).
Things seem to be working now.
This commit was SVN r14515.
- make opal_sockaddr2str() take a sockaddr_storage instead of a sockaddr_in6
so that it works for IPv4 and IPv6 addresses, and remove a whole bunch
of #ifs in the OOOB code.
- Fix a compiler warning in the TCP BTL due to run-time determined
array size by making it a dynamicly allocated array.
- Fix the unpacking code of IPv4 addresses when using IPv6 support, so
that the address is in the correct location (instead of in an IPv6
structure, use an IPv4 structure). Refs trac:1005.
This commit was SVN r14514.
The following Trac tickets were found above:
Ticket 1005 --> https://svn.open-mpi.org/trac/ompi/ticket/1005
There is a binomial algorithm in the code (i.e., the HNP would send to a subset of the orteds, which then relay it on according to the typical log-2 algo), but that has a bug in it so the code won't let you select it even if you tried (and the mca param doesn't show, so you'd *really* have to try).
This also involved a slight change to the oob.xcast API, so propagated that as required.
Note: this has *only* been tested on rsh, SLURM, and Bproc environments (now that it has been transferred to the OMPI trunk, I'll need to re-test it [only done rsh so far]). It should work fine on any environment that uses the ORTE daemons - anywhere else, you are on your own... :-)
Also, correct a mistake where the orte_debug_flag was declared an int, but the mca param was set as a bool. Move the storage for that flag to the orte/runtime/params.c and orte/runtime/params.h files appropriately.
This commit was SVN r14475.
Per discussions with Brian and Ralph, make a slight correction in
where components are installed. Use $pkglibdir, not $libdir/openmpi,
so that when compiled in the orte trunk, components are installed to
the right directory (because the component search patch is checking
$pkglibdir).
This commit was SVN r14345.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r14289
* Remove the connect() timeout code, as it had some nasty race conditions
when connections were established as the trigger was firing. A better
solution has been found for the cluster where this was needed, so just
removing it was easiest.
* When a fatal error (too many connection failures) occurs, set an error
on messages in the queue even if there isn't an active message. The
first message to any peer will be queued without being active (and
so will all subsequent messages until the connection is established),
and the orteds will hang until that first message completes. So if
an orted can never contact it's peer, it will never exit and just sit
waiting for that message to complete.
* Cover an interesting RST condition in the connect code. A connection
can complete the three-way handshake, the connector can even send
some data, but the server side will drop the connection because it
can't move it from the half-connected to fully-connected state because
of space shortage in the listen backlog queue. This causes a RST to
be received first time that recv() is called, which will be when waiting
for the remote side of the OOB ack. In this case, transition the
connection back into a CLOSED state and try to connect again.
* Add levels of debugging, rather than all or nothing, each building on
the previous level. 0 (default) is hard errors. 1 is connection
error debugging info. 2 is all connection info. 3 is more state
info. 4 includes all message info.
* Add some hopefully useful comments
This commit was SVN r14261.
OPAL_FREE_LIST_WAIT/RETURN will not use locks in a non-threaded build
conditionaly use locks if non-threaded around the OPAL_FREE_LIST_WAIT/RETURN
seems to fix the issue
Tested at 4K processes and seems to work..
This commit was SVN r14135.
listen thread, but we're not the HNP. This is better than not starting up
any listen mode, which is what we were doing before :/
This commit was SVN r14133.
This merge adds Checkpoint/Restart support to Open MPI. The initial
frameworks and components support a LAM/MPI-like implementation.
This commit follows the risk assessment presented to the Open MPI core
development group on Feb. 22, 2007.
This commit closes trac:158
More details to follow.
This commit was SVN r14051.
The following SVN revisions from the original message are invalid or
inconsistent and therefore were not cross-referenced:
r13912
The following Trac tickets were found above:
Ticket 158 --> https://svn.open-mpi.org/trac/ompi/ticket/158
its bigger than the timeout for the connect() call, just don't register
the handler by default and fall back to connect() timing out. Should give
much happier performance on big clusters.
This commit was SVN r13639.
the connect() timeout, so that we'll use that rather than our own timeout by
defualt. There timeout was set low for Big Red, but causes problems for very
large clusters, as there's no way to wire them up in 10 seconds most of the
time.
This commit was SVN r13062.
components that use configure.m4 for configuration or are always built.
The macro has not been needed since moving to configure types other than
configure.stub
Fixes trac:590
This commit was SVN r13031.
The following Trac tickets were found above:
Ticket 590 --> https://svn.open-mpi.org/trac/ompi/ticket/590
I know it's just a technicality, but it is time to address such things rather than just letting them continue to propagate. :-)
This commit was SVN r12954.
I found only two places that were looking at the tokens:
1. the odls - we used the tokens to separately process the globals container data from everything else. In this case, I left the subscription that returned the globals data alone, but "stripped" the subscription that returned the launch data for the procs. These subscriptions have nothing to do with the xcast message.
2. the pml_base_modex - the callback function was getting process names from the returned tokens. Actually, this function was doing a very bad thing - it was assuming that the first token returned was *always* the process name. This is currently true, but is one of those assumptions that someone could have easily changed - and suddenly found the system inexplicably failing. I modified the function to (a) get the name sent back to us, (b) "stripped" the value structures of tokens and segment strings, and (c) correctly obtained process names from the returned values. I also reindented the heck out of the code so it was legible (at least, to my old eyes).
This commit was SVN r12813.
Obviously, people like bproc will have to get the app_num via another avenue...but that's a problem for another day. Several options are easily available.
This commit was SVN r12788.
1. implement and enable the non-described buffer operations. I will send out a more detailed explanation separately. However, this mode of operation (which is now the default) significantly reduces message size during startup. If you want the described buffers, set the mca param "-mca dss_describe_buffer 1".
2. revise the xcast system to support both linear and binomial tree broadcast methods. Since we are seeing scenarios where the binomiall tree can cause problems, I have made the linear method the default. To run with the binomial tree, set the mca param "-mca oob_xcast_mode binomial".
3. add some detailed timing reports to the xcast operation. These are enabled via "-mca oob_xcast_timing 1".
4. add some more unit tests for the dss and gpr (focused on support for the non-described buffer)
This commit was SVN r12722.
because they are in ORTE, not OMPI. Also, remove the ORTE_PROCESS_NAME macros
in iof base as they are duplicates of the ones that were in ns_types, which
meant that bad things happened if you changed what an orte_process_name_t
looked like.
This commit was SVN r12646.
Accordingly, there are new APIs to the name service to support the ability to get a job's parent, root, immediate children, and all its descendants. In addition, the terminate_job, terminate_orted, and signal_job APIs for the PLS have been modified to accept attributes that define the extent of their actions. For example, doing a "terminate_job" with an attribute of ORTE_NS_INCLUDE_DESCENDANTS will terminate the given jobid AND all jobs that descended from it.
I have tested this capability on a MacBook under rsh, Odin under SLURM, and LANL's Flash (bproc). It worked successfully on non-MPI jobs (both simple and including a spawn), and MPI jobs (again, both simple and with a spawn).
This commit was SVN r12597.
1. Added reporting points around the xcasts in MPI_Init. Note that these times will include time spent waiting for a trigger to fire, which is why the times between stage gates did NOT include these times initially. The inter-stage-gate times still do NOT include the xcast time - the xcast time is reported separately.
2. Added the process vpid on the MPI_Init timing reports for clarity.
3. Added a report from the xcast function on the HNP that outputs the number of bytes in the message being sent to the processes.
This commit was SVN r12422.
packing a sockaddr_in, as there are some endianness and padding issues
with sending a sockaddr_in. Note that the sin_port and sin_addr are
already in network byte order, which is why we pack them as a byte
string.
Refs trac:493
This commit was SVN r12301.
The following Trac tickets were found above:
Ticket 493 --> https://svn.open-mpi.org/trac/ompi/ticket/493
seed value have something set to true. Allow selection of the listen
type to thread if (and only if) the process is the HNP...
This commit was SVN r12105.
__DARWIN_ALIGN_POWER define from the last release of the OS X compiler
toolchain. The bug in net/if.h, however, is still there. So look
for the hints that we're on a 64 bit Apple PowerPC instead.
* If we don't find a buffer size that works by 10MB, we're never
going to. So add some code to limit the buffer size we'll try
so that we don't fall into an infinite loop
* Detect errors in opal_ifcount in the oob init code
Refs trac:420
This commit was SVN r11825.
The following Trac tickets were found above:
Ticket 420 --> https://svn.open-mpi.org/trac/ompi/ticket/420
We were still waiting the entire duration of the timeout before we figured out that a connect() was successful. Re-introduce adding the peer_send_event so that we detect immediately when a connect() completes.
Also make sure to delete the timeout event in complete_connect().
Fixed a struct timeval initialization warning reported by Jeff.
Remove an erroneous opal_output().
This commit was SVN r11724.
The following SVN revision numbers were found above:
r11718 --> open-mpi/ompi@1b6231a9b5
Each 's' partition has its own TCP network. It's fine to use this network for jobs that fit inside the partition, but the TCP OOB errors when trying to connect across two partitions, because there are two disjoint networks. Each node also has another TCP network connecting ALL nodes together.
So the solution is to actually try all the available TCP interfaces on a node, instead of erroring when the first one fails.
Also, the default TCP connect() timeout is way too long (5 minutes) - use our own timeout mechanism, with the timeout value expressed as an MCA parameter.
This commit was SVN r11718.
- use the OPAL functions for PATH and environment variables
- make all headers C++ friendly
- no unamed structures
- no implicit cast.
Plus a full implementation for the orte_wait functions.
This commit was SVN r11347.
different macros, one for each project. Therefore, now we have OPAL_DECLSPEC,
ORTE_DECLSPEC and OMPI_DECLSPEC. Please use them based on the sub-project.
This commit was SVN r11270.
Other changes:
1. Remove the old xcpu components as they are not functional.
2. Fix a "bug" in orterun whereby we called dump_aborted_procs even when we normally terminated. There is still some kind of bug in this procedure, however, as we appear to be calling the orterun job_state_callback function every time a process terminates (instead of only once when they have all terminated). I'll continue digging into that one.
This will require an autogen/configure, I'm afraid.
This commit was SVN r11228.
Clean up the remainder of the size_t references in the runtime itself. Convert to orte_std_cntr_t wherever it makes sense (only avoid those places where the actual memory size is referenced).
Remove the obsolete oob barrier function (we actually obsoleted it a long time ago - just never bothered to clean it up).
I have done my best to go through all the components and catch everything, even if I couldn't test compile them since I wasn't on that type of system. Still, I cannot guarantee that problems won't show up when you test this on specific systems. Usually, these will just show as "warning: comparison between signed and unsigned" notes which are easily fixed (just change a size_t to orte_std_cntr_t).
In some places, people didn't use size_t, but instead used some other variant (e.g., I found several places with uint32_t). I tried to catch all of them, but...
Once we get all the instances caught and fixed, this should once and for all resolve many of the heterogeneity problems.
This commit was SVN r11204.
r10841, so revert it (and it's fixes) out. Will bring back once cleaned up from
the code used in the tbird experiment
This commit was SVN r10991.
The following SVN revision numbers were found above:
r10841 --> open-mpi/ompi@dfa1221c3b
handler before the write() and de-register it afterwards. Determine
if the write() succeeded or failed by the return of write().
This commit was SVN r10858.
than $(LN_S). This causes problems with with Windows and probably
elsewhere (re: #200). So use a slightly different trick to get the
right header selected for the MEMCPY and TIMER components.
* Using the same trick used to solve the AC_CONFIG_LINKS problem,
stop using a separate header file for direct calling in the
PML and MTL. This lets me remove some icky code in ompi_mca.m4
that was more fragile than I really liked.
This commit was SVN r10841.
Since Jeff and I are going to a branch for T-bird, we have restored the trunk to its prior state to avoid any possibility of disturbing it.
This commit was SVN r10774.
Please report any abnormal behavior during launch, though, as we would like to understand what (if any) impact is seen. I couldn't see any on small jobs (the modulo functions render this number down pretty low).
This commit was SVN r10763.
copy of the receive buffer based on the iovec struct that may have been updated
during partial reads to reflect the current offset. Need to make the copy using
the base address of the buffer.
Thanks to Sven Stork for finding this.
This should be backported to 1.0.X and 1.1.X branches.
This commit was SVN r9749.
thread, which will do progress independently of MPI. So in this case we
have to call opal_event_loop instead of opal_progress.
This commit was SVN r9551.
event library (since the event library has its own thread). So when
we are using progress threads, we really want to call opal_event_loop()
and not opal_progres().
This commit was SVN r9549.
- moved hton64 and ntoh64 from the bunch of places it had been copied
into one header file
- properly set and use the btl_tcp's nbo option to put things in
network byte order on the wire if both sides don't have the same
endianness
- Put the OB1 PML's headers (with a couple exceptions I need to discuss
with Tim) in network byte order on the wire if both sides don't have
the same endianness
- since it was needed for the TCP BTL, move the orte_process_name_t
HTON and NTOH macros from the TCP OOB to ns_types.h
This commit was SVN r9145.
- move files out of toplevel include/ and etc/, moving it into the
sub-projects
- rather than including config headers with <project>/include,
have them as <project>
- require all headers to be included with a project prefix, with
the exception of the config headers ({opal,orte,ompi}_config.h
mpi.h, and mpif.h)
This commit was SVN r8985.
intended to include the OMPI_DEBUG_ZERO call).
These debugging statements should not have affected correcteness
because the value of 78 will be overridden in the read() and the
assert()/abort() stuff will only be triggered on an error which should
never happen (i.e., the error should have been handled by the prior if
conditional). But still, thise code should not be there.
This commit was SVN r8649.
The following SVN revision numbers were found above:
r8643 --> open-mpi/ompi@a6b869ed68
- Need to make sure that SIZE_MAX exists as a constant if stdint.h
doesn't exist
- struct timeval is defined in unistd.h on IRIX, so need to include
that headerfile where ever struct timeval is used.
This commit was SVN r8361.
* turns out (duh!) that there was a reason that the <projectdir>dir
variable was set in the AM conditional. If not, stupid directories
are created and not needed... duh.
This commit was SVN r8205.
component/base Makefile.am files, reducing the time configure spends
stamping out Makefiles at the end
* Install base_impl.h file when devel-headers are being installed
This commit was SVN r8200.
originally suggested by Ralf Wildenhues, to try to speed autogen, configure,
and make (and possibly even make install). Use automake's include directive
to drastically reduce the number of Makefile files (although the number of
Makefile.am files is the same - most are just included in a top-level
Makefile.am). Also use an Automake SUBDIRs feature to eliminate the
dynamic-mca tree, which was no longer really needed. This makes adding
a framework easier (since you don't have to remember the dynamic-mca
tree) and makes building faster (as make doesn't have to recurse through
the dynamic-mca tree)
This commit was SVN r7777.