http://www.open-mpi.org/community/lists/devel/2010/07/8240.php
Documentation:
http://osl.iu.edu/research/ft/
Major Changes:
--------------
* Added C/R-enabled Debugging support.
Enabled with the --enable-crdebug flag. See the following website for more information:
http://osl.iu.edu/research/ft/crdebug/
* Added Stable Storage (SStore) framework for checkpoint storage
* 'central' component does a direct to central storage save
* 'stage' component stages checkpoints to central storage while the application continues execution.
* 'stage' supports offline compression of checkpoints before moving (sstore_stage_compress)
* 'stage' supports local caching of checkpoints to improve automatic recovery (sstore_stage_caching)
* Added Compression (compress) framework to support
* Add two new ErrMgr recovery policies
* {{{crmig}}} C/R Process Migration
* {{{autor}}} C/R Automatic Recovery
* Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} ErrMgr component
* Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} configure option)
* {{{OMPI_CR_Checkpoint}}} (Fixes trac:2342)
* {{{OMPI_CR_Restart}}}
* {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
* {{{OMPI_CR_INC_register_callback}}} (Fixes trac:2192)
* {{{OMPI_CR_Quiesce_start}}}
* {{{OMPI_CR_Quiesce_checkpoint}}}
* {{{OMPI_CR_Quiesce_end}}}
* {{{OMPI_CR_self_register_checkpoint_callback}}}
* {{{OMPI_CR_self_register_restart_callback}}}
* {{{OMPI_CR_self_register_continue_callback}}}
* The ErrMgr predicted_fault() interface has been changed to take an opal_list_t of ErrMgr defined types. This will allow us to better support a wider range of fault prediction services in the future.
* Add a progress meter to:
* FileM rsh (filem_rsh_process_meter)
* SnapC full (snapc_full_progress_meter)
* SStore stage (sstore_stage_progress_meter)
* Added 2 new command line options to ompi-restart
* --showme : Display the full command line that would have been exec'ed.
* --mpirun_opts : Command line options to pass directly to mpirun. (Fixes trac:2413)
* Deprecated some MCA params:
* crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
* snapc_base_global_snapshot_dir deprecated, use sstore_base_global_snapshot_dir
* snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
* snapc_base_store_in_place deprecated, replaced with different components of SStore
* snapc_base_global_snapshot_ref deprecated, use sstore_base_global_snapshot_ref
* snapc_base_establish_global_snapshot_dir deprecated, never well supported
* snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
Minor Changes:
--------------
* Fixes trac:1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint handles and does the right thing.
* Fixes trac:2097 : {{{ompi-info}}} should now report all available CRS components
* Fixes trac:2161 : Manual checkpoint movement. A user can 'mv' a checkpoint directory from the original location to another and still restart from it.
* Fixes trac:2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
* Move {{{ompi_cr_continue_like_restart}}} to {{{orte_cr_continue_like_restart}}} to be more flexible in where this should be set.
* opal_crs_base_metadata_write* functions have been moved to SStore to support a wider range of metadata handling functionality.
* Cleanup the CRS framework and components to work with the SStore framework.
* Cleanup the SnapC framework and components to work with the SStore framework (cleans up these code paths considerably).
* Add 'quiesce' hook to CRCP for a future enhancement.
* We now require a BLCR version that supports {{{cr_request_file()}}} or {{{cr_request_checkpoint()}}} in order to make the code more maintainable. Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer to use {{{cr_request_checkpoint()}}}.
* Add optional application level INC callbacks (registered through the CR MPI Ext interface).
* Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds to make the C/R thread less aggressive.
* {{{opal-restart}}} now looks for cache directories before falling back on stable storage when asked.
* {{{opal-restart}}} also support local decompression before restarting
* {{{orte-checkpoint}}} now uses the SStore framework to work with the metadata
* {{{orte-restart}}} now uses the SStore framework to work with the metadata
* Remove the {{{orte-restart}}} preload option. This was removed since the user only needs to select the 'stage' component in order to support this functionality.
* Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no longer hard codes {{{-am ft-enable-cr}}}.
* Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 'fixed' the problem, then it should be skipped.
* Make sure to decrement the number of 'num_local_procs' in the orted when one goes away.
* odls now checks the SStore framework to see if it needs to load any checkpoint files before launching (to support 'stage'). This separates the SStore logic from the --preload-[binary|files] options.
* Add unique IDs to the named pipes established between the orted and the app in SnapC. This is to better support migration and automatic recovery activities.
* Improve the checks for 'already checkpointing' error path.
* A a recovery output timer, to show how long it takes to restart a job
* Do a better job of cleaning up the old session directory on restart.
* Add a local module to the autor and crmig ErrMgr components. These small modules prevent the 'orted' component from attempting a local recovery (Which does not work for MPI apps at the moment)
* Add a fix for bounding the checkpointable region between MPI_Init and MPI_Finalize.
This commit was SVN r23587.
The following Trac tickets were found above:
Ticket 1924 --> https://svn.open-mpi.org/trac/ompi/ticket/1924
Ticket 2097 --> https://svn.open-mpi.org/trac/ompi/ticket/2097
Ticket 2161 --> https://svn.open-mpi.org/trac/ompi/ticket/2161
Ticket 2192 --> https://svn.open-mpi.org/trac/ompi/ticket/2192
Ticket 2208 --> https://svn.open-mpi.org/trac/ompi/ticket/2208
Ticket 2342 --> https://svn.open-mpi.org/trac/ompi/ticket/2342
Ticket 2413 --> https://svn.open-mpi.org/trac/ompi/ticket/2413
status structure or the _ucount field in the status structure.
On 64-bit sparc, the macros resolve into integer array assignments.
For all others, they are just simple assignments. This fixes
possible BUS errors seen when running on the SPARC processor.
This bug was introduced when the _count field changed from an int
into a size_t. See the changes to request.h for additional details.
This commit fixes trac:2514.
This commit was SVN r23554.
The following Trac tickets were found above:
Ticket 2514 --> https://svn.open-mpi.org/trac/ompi/ticket/2514
step is the configure and Fortran mojo that Jeff will put in. Until then I
guess the Fortran interface is broken (at least all functions using the hidden
count firld in the MPI_Status).
This commit was SVN r23467.
operations, not ints.
Sorry for the mid-day configure.ac change, folks...
This commit was SVN r23449.
The following Trac tickets were found above:
Ticket 2472 --> https://svn.open-mpi.org/trac/ompi/ticket/2472
Create an ability to store the contact info for multiple HNPs being used to route between different job families. Modify the dpm orte module to pass the resulting store during the connect_accept procedure so that all jobs involved in the resulting communicator know how to route OOB messages between them.
Add a test provided by Philippe that tests this ability.
This commit was SVN r23438.
case individual entries aren't used, but dynamic rules are enabled
(i.e., at least one or more of them are not NULL, meaning that they'll
all be assumed to be either NULL or a valid value).
This commit was SVN r23361.
#define CACHE_LINE_SIZE to 128. This name has a conflict on NetBSD,
and it seems kinda odd to have a header file that ''only'' defines a
single value. Also, we'll soon be raising hwloc to be a first-class
item, so having this file around seemed kinda weird.
Therefore, I replaced CACHE_LINE_SIZE with opal_cache_line_size, an
int (in opal/runtime/opal_init.c and opal/runtime/opal.h) on the
rationale that we can fill this in at runtime with hwloc info (trunk
and v1.5/beyond, only). The only place we ''needed'' a compile-time
CACHE_LINE_SIZE was in the BTL SM (for struct padding), so I made a
new BTL_SM_ preprocessor macro with the old CACHE_LINE_SIZE value
(128). That use isn't suitable for run-time hwloc information,
anyway.
This commit was SVN r23349.
Configure Option:
--enable-sysv
MCA Parameter:
mpi_common_sm
mpi_common_sm accepts a comma delimited list of: [sysv],mmap (order
dependent). The first component that is successfully selected is used. For
example, -mca mpi_common_sm sysv,mmap will first try sysv. If sysv is not
successfully selected, then mmap will be used. mmap will be used if
mpi_common_sm is not provided.
Notes:
Please make certain that your system's shmmax limit, or equivalent, is larger
than mpool_sm_min_size. Otherwise, shmget may fail.
This commit was SVN r23260.
At this point, it is just cleared (and ignored) so default behavior has not changed.
However, future failover support can take advantage of this flag.
Reviewed by Pasha Shamis.
This commit was SVN r23204.
make sure that we do not call coll_gather and coll_bcast in the very same
instances, since some collective (intra) modules do not seem to like the fact
if they are called for scount or rcount being zero (for regular
intra-communicator operations, this is handled on the MPI API layer).
Fixes trac:2405
This commit was SVN r23188.
The following Trac tickets were found above:
Ticket 2405 --> https://svn.open-mpi.org/trac/ompi/ticket/2405
allows the BTL to specify a specific ompi_proc_t that had an
error. Also add an optional descriptive string. Currently, arguments
are not used but will be by future failover PML.
Changes based on RFC. Reviewed by George Bosilca.
This commit was SVN r23174.
The fix is to just check if the return value is positive or not, since all the SOS encoded errors are *always* negative.
The real fix (as Ralph points out) is to change these functions (opal_pointer_array_add and mca_base_param*) to return the index as a pointer.
This commit was SVN r23173.
(OMPI_ERR_* = OPAL_SOS_GET_ERR_CODE(ret)), since the return value could be a
SOS-encoded error. The OPAL_SOS_GET_ERR_CODE() takes in a SOS error and returns
back the native error code.
* Since OPAL_SUCCESS is preserved by SOS, also change all calls of the form
(OPAL_ERROR == ret) to (OPAL_SUCCESS != ret). We thus avoid having to
decode 'ret' to get the native error code.
This commit was SVN r23162.
http://marc.info/?l=linux-mm-commits&m=127352503417787&w=2 for more
details.
* Remove the ptmalloc memory component; replace it with a new "linux"
memory component.
* The linux memory component will conditionally compile in support
for ummunotify. At run-time, if it has ummunotify support and
finds run-time support for ummunotify (i.e., /dev/ummunotify), it
uses it. If not, it tries to use ptmalloc via the glibc memory
hooks.
* Add some more API functions to the memory framework to accomodate
the ummunotify model (i.e., poll to see if memory has "changed").
* Add appropriate calls in the rcache to the new memory APIs to see
if memory has changed, and to react accordingly.
* Add a few comments in the openib BTL to indicate why we don't need
to notify the OPAL memory framework about specific instances of
registered memory.
* Add dummy API calls in the solaris malloc component (since it
doesn't have polling/"did memory change" support).
This commit was SVN r23113.
ompi_ptr_t by translating into void*. Instead keep it as an ompi_ptr_t all
the way. Thanks to Timur Magomedov for helping to track down this issue and
test the patch.
cmr:v1.4
cmr:v1.5
This commit was SVN r23030.
1. fix a bug that caused an infinite loop in configure when specifying want-ft but not want-ft-thread by removing a stale reference to the opal-progress-thread option
2. add want-ft=orcm so we can build the orcm errmgr component
3. cleanup the use of "ompi_want_ft_xxx" and replace it with "opal_want_ft_xxx" so that naming conventions are preserved
This commit was SVN r22885.
Remove the --enable-progress-threads option as this is no longer functional, and hardcode OPAL_ENABLE_PROGRESS_THREADS to 0.
Replace the --enable-mpi-threads option with --enable-mpi-thread-multiple as this is clearer as to meaning. This option automatically turns "on" opal thread support if it wasn't already so specified. If the user specifies --disable-opal-multi-threads --enable-mpi-thread-multiple, we will error out with a message
Add a new --enable-opal-multi-threads option that turns "on" opal thread support without doing anything wrt mpi-thread-multiple
This commit was SVN r22841.
Many of the OPAL_ENABLE_FT should be OPAL_ENABLE_FT_CR, so fix those.
The OPAL Layer INC should call opal_output on restart so that it can refresh the string it prints to reflect the current pid/hostname which may have changed.
This commit was SVN r22824.
Short version: there is a bug in OS X/Snow Leopard, but there is also
a bug in Open MPI. Fixing the bug in Open MPI is both trivial (a
1-line change) and avoids the bug in OS X. We'll file an OS X bug
report upstream with Apple, but it should no longer affect us here in
OMPI.
Fixes trac:2039.
More details:
Some background first:
1. IPv4 sockets can only accept incoming IPv4 connections. However,
IPv6 sockets can be configured to accept ''only'' incoming IPv6
connection, or ''both'' incoming IPv4 and IPv6 connections. An
IPv6 socket attribute sets which listening behavior is used.
1. IPv4 and IPv6 have different port namespaces. Hence, it is
permissable to bind a v4 socket to port X ''and'' also bind a v6
socket to that same port X on the same interface (assuming that
the v6 socket is only accepting incoming v6 connections).
Incoming v4 connections to port X on the interface should get
matched to the listening v4 socket; incoming v6 connections should
get matched to the listening v6 socket.
1. When v6 sockets accept ''both'' incoming v4 and v6 connections, it
should claim port X in both namespaces.
1. Linux's default behavior is to only allow one listening socket to
be bound to a given port (i.e., ''either'' a v6 or v4 socket to be
bound to a single port X -- not both). A v6 socket can listen for
both v4 and v6 incoming connections on that port, but still --
only one socket will be bound to that port.
1. Snow Leopard's default behavior is to share ports -- i.e., let
both a v4 and a v6 listening socket to be bound to port X
(assuming that the v6 socket is only accepting incoming v6
connections).
The TCP BTL creates a listening socket for each address family.
Hence, it creates a v4 listening socket on INADDR_ANY and a v6
listening socket on the v6 equivalent of INADDR_ANY. OMPI then
iteratively tries to find ports to listen on within the range of
[mca_btl_tcp_port_min, mca_btl_tcp_port_min + mca_btl_tcp_port_range).
On Linux, the v4 socket will be bound to port X and the v6 socket will
likely be bound to port Y (where X != Y). On Snow Leopard, the v4
socket will be bound to port X and the v6 socket may ''also'' be bound
to port X. Since the namespaces are separate, this shouldn't be a
problem.
However, Open MPI was accidentally setting the v6 listening behavior
to accept ''both'' v4 and v6 incoming connections. This is a trivial
thing to fix -- change a 0 to a 1 in the code. On Linux, this issue
didn't matter because the v4 and v6 sockets were on different ports.
So even though the v6 socket ''would'' have accepted incoming v4
connections, that never happened because OMPI would direct v4
connections to the v4 port.
But on Snow Leopard, the v4 and v6 listening ports could end up
sharing the same port number. As mentioned above, this ''shouldn't''
have been a problem, but it looks like Snow Leopard has the following
bugs:
* If a v4 socket is already bound to port X, we're pretty sure that a
v6 socket with the "accept both v4 and v6 incoming connections"
listening behavior should not be able to claim port X (because
there's already a v4 socket listening on X). However, Snow Leopard
would allow binding a v4 socket to port X, and then allow a v6
socket configured to allow incoming v4 and v6 connections to
''also'' be bound to port X.
* After binding the v6 socket to port X, Snow Leopard then lets
''another'' v4 socket ''also'' get bound to port X. Hence, there's
now '''three''' sockets all listening on port X.
This obviously led to mis-matched TCP connections, and things went
downhill from there.
That being said, Snow Leopard doesn't exhibit this behavior if v6
sockets only allow incoming v6 connections. And technically, that is
exactly the behavior we want (we want v6 sockets to only accept
incoming v6 connections). So if we just change the flag to make our
v6 listening socket us this behavior, the problem on OS X goes away.
That's what this commit does -- it changes a 0 to a 1, indicating
"only let this v6 socket allow incoming v6 connections."
That was simple, wasn't it?
This commit was SVN r22788.
The following Trac tickets were found above:
Ticket 2039 --> https://svn.open-mpi.org/trac/ompi/ticket/2039
1. The code that looks at btl_tcp_if_exclude before doing a
modex_send uses strcmp rather than strncmp. That means that
"lo0" gets sent even though "lo" is excluded.
2. The code that determines whether a particular local TCP
interface can connect to a particular remote interface doesn't
check for loopback interfaces. With this fix, users can now
enable "lo" and be assured that it will only be used for intra-
node communication.
This commit was SVN r22762.
btl_openib_ip.*. The routines in these files are not specific to
iwarp -- they are specific to IP interfaces used with IBV devices
(even IB or IBoE/RoCEE/whatever devices).
This commit was SVN r22718.
issues with iwarp.c. These fixes are needed for IBoE / ROCEE /
whateveritscalledtoday. I added a few minor changes to his base
patch.
This commit was SVN r22717.
mca_osc_rdma_component.c_modules in ompi_osc_rdma_windx_to_module
Fixes case where there is unprotected access to
mca_osc_rdma_component.c_modules in ompi_osc_rdma_windx_to_module
This commit was SVN r22700.
INTERNAL to EXTRA_RETAIN, because not all "internal" communicators
have this flag set (only internal communicators with CIDs less than
their parent). Hence, what this flag ''really'' means is that there
was an extra RETAIN performed on it. So name the flag just that --
EXTRA_RETAIN -- indicating that an extra RETAIN has occurred.
This commit was SVN r22690.
The following SVN revision numbers were found above:
r22671 --> open-mpi/ompi@61dee816db
can occur ( fixes trac:2111 ).
Should not deregister memory with the rcache lock held otherwise a deadlock can occur as the lower
level infiniband libraries can free memory ( fixes trac:2110 )
cmr:v1.4
This commit was SVN r22683.
The following Trac tickets were found above:
Ticket 2110 --> https://svn.open-mpi.org/trac/ompi/ticket/2110
Ticket 2111 --> https://svn.open-mpi.org/trac/ompi/ticket/2111
as this can result in a low level free of memory which
can require the rcache lock resulting in a deadlock
This fixes trac:2107
cmr:v1.4
This commit was SVN r22679.
The following Trac tickets were found above:
Ticket 2107 --> https://svn.open-mpi.org/trac/ompi/ticket/2107
when protecting the no_wqe_pending_frags list.
fixes trac:2118 add cmr:v1.4
This commit was SVN r22678.
The following Trac tickets were found above:
Ticket 2118 --> https://svn.open-mpi.org/trac/ompi/ticket/2118
Also includes some minor copytight header additions that were missed in previous checkins
fixes trac:2101 added cmr:v1.4
This commit was SVN r22676.
The following Trac tickets were found above:
Ticket 2101 --> https://svn.open-mpi.org/trac/ompi/ticket/2101
communicator that we created has a lower CID than the parent comm. This can
happen when using the hierarch collective communication module or for
inter-communicators (since we make a duplicate of the original communicator).
This is not a problem as long as the user calls MPI_Comm_free on the parent
communicator. However, if the communicators are not freed by the user but
released by Open MPI in MPI_Finalize, we walk through the list of still
available communicators and free them one by one. Thus, local_comm is freed
before the actual inter-communicator. However, the local_comm pointer in the
inter communicator will still contain the 'previous' address of the local_comm
and thus this will lead to a segmentation violation. In order to prevent that
from happening, we increase the reference counter local_comm by one if its CID
is lower than the parent. We cannot increase however its reference counter if
the CID of local_comm is larger than the CID of the inter communicators, since
a regular MPI_Comm_free would leave in that the case the local_comm hanging
around and thus we would not recycle CID's properly, which was the reason and
the cause for this trouble.
This commit fixes tickets 2094 and 2166. Note however, that I want to close
them manually, since a slightly different patch is required for the 1.4
series. This commit will have to be applied for the 1.5 series. And I will
need a volunteer to review it.
This commit was SVN r22671.