1
1
openmpi/opal/mca/crs/self
Josh Hursey 5406fdfb80 Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option).
This commit looks larger than it really is since it includes a fair amount of code cleanup.

The SIGSTOP/SIGCONT+checkpointing work uses some of the functionality in r20391. Basic use case below (note that the checkpoint generated is useable as usual if the stopped application is terminated).
{{{
shell 1) mpirun -np 2 -am ft-enable-cr my-app
... running ...

shell 2) ompi-checkpoint --stop -v MPIRUN_PID
[localhost:001300] [  0.00 /   0.20]                 Requested - ...
[localhost:001300] [  0.00 /   0.20]                   Pending - ...
[localhost:001300] [  0.01 /   0.21]                   Running - ...
[localhost:001300] [  1.01 /   1.22]                   Stopped - ompi_global_snapshot_1234.ckpt
Snapshot Ref.: 0 ompi_global_snapshot_1234.ckpt

shell 2) killall -CONT mpirun

... Application Continues execution in shell 1 ...
}}}

Other items in this commit are mostly cleanup that has been sitting off-trunk for too long:
 * Add a new {{{opal_crs_base_ckpt_options_t}}} type that encapsulates the various options that could be passed to the CRS. Currently only TERM and STOP, but this makes adding others ''much'' easier.
 * Eliminate ORTE_SNAPC_CKPT_STATE_PENDING_TERM, since it served a redundant purpose with the new options type.
 * Lay some basic ground work for some future features.

This commit was SVN r21995.

The following SVN revision numbers were found above:
  r20391 --> open-mpi/ompi@0704b98668
2009-09-22 18:26:12 +00:00
..
configure.m4 This is a minor cleanup of the configure.m4 (per suggestion from Jeff). 2009-08-07 23:38:54 +00:00
configure.params Merging in the jjhursey-ft-cr-stable branch (r13912 : HEAD). 2007-03-16 23:11:45 +00:00
crs_self_component.c After talking through the patch with Jeff, we have a couple more fixes to r21766 that should also go over to v1.3 in Ticket #1987. 2009-08-05 22:07:37 +00:00
crs_self_module.c Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option). 2009-09-22 18:26:12 +00:00
crs_self.h Add support for sending SIGSTOP the MPI job after the checkpoint is taken (uses a BLCR feature for the option). 2009-09-22 18:26:12 +00:00
help-opal-crs-self.txt After talking through the patch with Jeff, we have a couple more fixes to r21766 that should also go over to v1.3 in Ticket #1987. 2009-08-05 22:07:37 +00:00
Makefile.am Per long threads on the mailing list and much confusion discussion 2007-12-15 13:32:02 +00:00