Checkpointing is a facility to save the complete status of an executing
       program  or  job  and to restore and restart from this so called check-
       point at a later point of time if  the  original  program  or  job  was
       halted, e.g.  through a system crash.

       Univa Grid Engine provides various levels of checkpointing support (see
       sge_ckpt(1)).  The checkpointing environment described here is a  means
       to configure the different types of checkpointing in use for your Univa
       Grid Engine cluster or parts thereof. For that purpose you  can  define
       the  operations  which  have  to be executed in initiating a checkpoint
       generation, a migration of a checkpoint to another host or a restart of
       a checkpointed application as well as the list of queues which are eli-
       gible for a checkpointing method.

       Supporting different operating systems  may  easily  force  Univa  Grid
       Engine to introduce operating system dependencies for the configuration
       of the checkpointing configuration file and updates  of  the  supported
       operating  system  versions may lead to frequently changing implementa-
       tion details.  Please refer to the <sge_root>/ckpt directory  for  more

       Please use the -ackpt, -dckpt, -mckpt or -sckpt options to the qconf(1)
       command to manipulate checkpointing environments from the  command-line
       or  use the corresponding qmon(1) dialogue for X-Windows based interac-
       tive configuration.

       Note, Univa Grid Engine allows backslashes (\) be used to  escape  new-
       line  (\newline) characters. The backslash and the newline are replaced
       with a space (" ") character before any interpretation.

       The format of a checkpoint file is defined as follows:

       The name of the checkpointing environment as defined for  ckpt_name  in
       sge_types(1).   qsub(1)  -ckpt  switch or for the qconf(1) options men-
       tioned above.

       The type of checkpointing to be used. Currently,  the  following  types
       are valid:

              The Hibernator kernel level checkpointing is interfaced.

       cpr    The SGI kernel level checkpointing is used.

              The Cray kernel level checkpointing is assumed.

              restart_command (see below), which is not used (even  if  it  is
              configured)  but  the job script is invoked in case of a restart

       A command-line type command string to be executed by Univa Grid  Engine
       in order to initiate a checkpoint.

       A  command-line type command string to be executed by Univa Grid Engine
       during a migration of a checkpointing job from one host to another.

       A command-line type command string to be executed by Univa Grid  Engine
       when restarting a previously checkpointed application.

       A  command-line type command string to be executed by Univa Grid Engine
       in order to cleanup after a checkpointed application has finished.

       A file system location to which checkpoints of potentially considerable
       size should be stored.

       A  Unix  signal  to be sent to a job by Univa Grid Engine to initiate a
       checkpoint generation. The value for this field can either  be  a  sym-
       bolic  name from the list produced by the -l option of the kill(1) com-
       mand or an integer number which must be a valid signal on  the  systems
       used for checkpointing.

       The  points  of  time  when  checkpoints  are expected to be generated.
       Valid values for this parameter are composed by the letters s, m, x and
       r  and  any  combinations  thereof  without any separating character in
       between. The same letters are allowed for the -c option of the  qsub(1)
       command  which will overwrite the definitions in the used checkpointing
       environment.  The meaning of the letters is defined as follows:

       s      A job is checkpointed, aborted and if possible migrated  if  the
              corresponding sge_execd(8) is shut down on the job's machine.

       m      Checkpoints  are  generated periodically at the min_cpu_interval
              interval defined by the queue (see queue_conf(5)) in which a job

       x      A  job is checkpointed, aborted and if possible migrated as soon
              as the job gets suspended (manually as well as automatically).

       r      A job will be rescheduled (not checkpointed) when  the  host  on
              which  the  job  currently  runs went into unknown state and the
              time interval reschedule_unknown (see  sge_conf(5))  defined  in

       means to detect this.

       sge_intro(1),  sge_ckpt(1),  sge__types(1), qconf(1), qmod(1), qsub(1),

       See sge_intro(1) for a full statement of rights and permissions.

UGE 8.0.0                $Date: 2007/02/14 12:58:39 $            CHECKPOINT(5)

Man(1) output converted with man2html