Re: API for checkpoint

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Aug 01 2007 - 14:09:53 PDT

  • Next message: Paul H. Hargrove: "Re: API for checkpoint"
    Neal,
      Your code below is (as best as I can tell w/o running it) correct.  
    However, the dependence on the specific return values -1 and -2 from 
    cr_poll_checkpoint() is probably unsafe.  I'll see about adding CR_* 
    constants to replace the explicit -1 and -2 values in the next beta 
    release.  Once added, you'll find the descriptions in libcr.h and use 
    example in cr_checkpoint.c.
      You are correct that the checkpoint is not necessarily flushed to disk 
    when the poll call succeeds.  You'll have to make your own fsync() call 
    if you require that guarantee.
      Feel free to ask if you need any more clarifications.
      Below I've listed some of the advantages of this new API over the 
    previous one.
    
    -Paul
    
    1)  This interface allows for a bounded wait (Despite the loop around 
    the cr_poll_checkpoint() call, the common case is single-trip.  This is 
    because the 2nd arg to cr_poll_checkpoint() is a (struct timeval *) like 
    the final argument to select().  Thus the NULL value here means to wait 
    forever (or until interrupted by a signal).  A non-NULL 2nd arg would 
    allow you to perform a bounded wait.)
    2) This interface allows for multi-process scopes (tree, pgrp, session).
    3) This interface allows for checkpointing something other than oneself.
    4) This interface allows multiple checkpoints (of distinct targets) to 
    be in-flight simultaneously.
    5) This interface returns error codes when something goes wrong.
    
    Neal Becker wrote:
    > Based on studying the code in cr_checkpoint.c, I have come up with the 
    > following.  Any comments appreciated.  I'm guessing that the call to 
    > cr_request_checkpoint, followed by the cr_poll_checkpoint, will efficiently 
    > do the checkpoint and then wait for it to complete (but not necessarily be 
    > flushed to the disk).
    >
    > static void doit () {
    >     
    >   int newfd = creat (newname.c_str(), 0600);
    >   if (newfd < 0)
    >     die ("creat failed");
    >
    >   cr_args.cr_fd = newfd;
    >   cr_args.cr_scope = ...
    >
    >
    >   cr_checkpoint_handle_t cr_handle;
    >
    >   int err = cr_request_checkpoint (&cr_args, &cr_handle);
    >   if (err < 0)
    >     die ("cr_request_checkpoint failed");
    >
    >   do {
    >     int err = cr_poll_checkpoint (&cr_handle, NULL);
    >     if (err < 0) {
    >       if (errno == EINVAL) {
    > 	return;			// restarted
    >       }
    >       else if (errno == EINTR) {
    > 	;
    >       }
    >       else {
    > 	perror ("cr_poll_checkpoint");
    > 	break;
    >       }
    >     }
    >     else if (err == 0) {
    >       die ("cr_poll_checkpoint returned unexepected 0");
    >     }
    >   } while (err < 0);
    >
    >   if (err == -1) {
    >     die (std::string ("cr_poll_checkpoint") + strerror (err));
    >   }
    >   else if (err == -2) {
    >     if (err == CR_ETEMPFAIL) {
    >       die("Checkpoint cancelled by application: try again later\n");
    >     } else if (err == ESRCH) {
    >       die("Checkpoint failed: no processes checkpointed\n");
    >     } else if (err == CR_EPERMFAIL) {
    >       die("Checkpoint cancelled by application: unable to checkpoint\n");
    >     } else if (err == CR_ENOSUPPORT) {
    >       die("Checkpoint failed: support missing from application\n");
    >     } else {
    >       die(std::string ("ioctl") + strerror (err));
    >     }
    >   }
    >   else if (err < 0) {
    >     die(std::string ("cr_poll_checkpoint") + strerror (err));
    >   }
    >
    >   if (rename (newname.c_str(), name.c_str()) != 0)
    >     die ("rename failed");
    > }
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: API for checkpoint"