1
1
openmpi/opal/runtime
Josh Hursey 5406fdfb80 Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option).
This commit looks larger than it really is since it includes a fair amount of code cleanup.

The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated).
{{{
shell 1) mpirun -np 2 -am ft-enable-cr my-app
... running ...

shell 2) ompi-checkpoint --stop -v MPIRUN_PID
[localhost:001300] [  0.00 /   0.20]                 Requested - ...
[localhost:001300] [  0.00 /   0.20]                   Pending - ...
[localhost:001300] [  0.01 /   0.21]                   Running - ...
[localhost:001300] [  1.01 /   1.22]                   Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt

shell 2) killall -CONT mpirun

... Application Continues execution in shell 1 ...
}}}

Other items in this commit are mostly cleanup that has been sitting off-trunk for too long:
 * Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier.
 * Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type.
 * Lay some basic ground work for some future features.

This commit was SVN r21995.

The following SVN revision numbers were found above:
  r20391 --> open-mpi/ompi@0704b98668
2009-09-22 18:26:12 +00:00
..
help-opal-runtime.txt Remove some old references to ft_enable parameter that no longer exists. 2007-03-17 20:02:42 +00:00
Makefile.am Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). 2007-03-16 23:11:45 +00:00
opal_cr.c Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option). 2009-09-22 18:26:12 +00:00
opal_cr.h Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option). 2009-09-22 18:26:12 +00:00
opal_finalize.c Remove the filter framework - the xml support will have to be provided in a different manner that will be implemented shortly 2008-06-04 09:04:51 +00:00
opal_init.c Check return codes when init'ing the paffinity framework to avoid segfaulting 2009-08-26 01:58:15 +00:00
opal_params.c - Split the datatype engine into two parts: an MPI specific part in 2009-07-13 04:56:31 +00:00
opal_progress.c This is a very large change to rename several #define values from 2009-05-06 20:11:28 +00:00
opal_progress.h - Replace combinations of 2009-08-20 11:42:18 +00:00
opal.h Enable modex-less launch. Consists of: 2008-12-09 23:49:02 +00:00