Re: multiple checkpoints

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 30 2005 - 13:58:59 PST

  • Next message: Kris Buggenhout: "chekpoint support for amd64?"
    Richard,
    
    The BLCR kernel code can only handle a single checkpoint outstanding per 
    target process.  Note that the request is somewhat asynchronous and the 
    first cr_request_file() may return while the thread-context callback is 
    still actually running, though the process will be stopped when the 
    callback invokes cr_checkpoint().  The second call to cr_request_file() 
    must therefore wait for the first checkpoint to actually complete before 
    the second call can begin to take a checkpoint.  At the same time it 
    will also reap the completion code of the previous request (similar to 
    waitpid() for processes).
    
    -Paul
    
    Richard Hu wrote:
    
    > Thank you for your response.
    >
    > What exactly do you mean by retire?  I'm taking it to mean that 
    > there's some kind of clean-up that needs to be done before restarting 
    > from a checkpoint.  Is that a correct interpretation?  Also, I wanted 
    > to add that this behavior doesn't seem to happen when there is quite a 
    > bit of activity between the two checkpoints.  For example, if the two 
    > checkpoints in the sample program were separated by thousands of lines 
    > of code doing complex calculations, I don't have this problem.  I 
    > assume this has to with BLCR being able to retire the first checkpoint 
    > before starting the second, right?
    >
    > Thanks,
    > Richard
    >
    > At 03:58 PM 3/30/2005, you wrote:
    >
    >> Richard,
    >>
    >>  I don't have a certain answer, but I can guess.  I suspect that when 
    >> you see the hang BLCR is trying to retire the first checkpoint before 
    >> starting the second.  When restarted from the 1st checkpoint the user 
    >> space part of BLCR believes that there is a previous checkpoint to 
    >> retire, but the kernel disagrees.
    >>  I've entering a bug report at 
    >> http://mantis.lbl.gov/bugzilla/show_bug.cgi?id=1037
    >>
    >> -Paul
    >>
    >> Richard Hu wrote:
    >>
    >>> To Whom It May Concern:
    >>>
    >>> I appear to be having an issue with multiple checkpoints in BLCR and 
    >>> I was wondering if you could perhaps shed some light on the 
    >>> problem.  I have attached a simple test program to demonstrate my 
    >>> problem.
    >>> Essentially when I run the program, two checkpoints are generated 
    >>> with some activity happening between the checkpoints.  When I 
    >>> restart from the second checkpoint (for_loop_1), everything works.  
    >>> When I restart from the first checkpoint (for_loop_1), the program 
    >>> hangs when it hits the spot in the program where it attempts to 
    >>> create the second checkpoint.  Do you know why this happens?  Is 
    >>> there a possible work-around?
    >>>
    >>> Thanks,
    >>> Richard Hu
    >>> rhu_at_opnet_dot_com
    >>>
    >>> ------------------------------------------------------------------------ 
    >>>
    >>>
    >>> #include <stdio.h>
    >>> #include <stdlib.h>
    >>> #include "libcr.h"
    >>> #include <math.h>
    >>> #include <string.h>
    >>>
    >>> int callback(void *arg);
    >>>
    >>> int main (void) {
    >>>  int counter;
    >>>  char path[100] = "/usr/local/for_loop_";
    >>>  char num[20];
    >>>
    >>>  cr_init();
    >>>  cr_register_callback(callback, NULL, CR_THREAD_CONTEXT);
    >>>  counter = 0;
    >>>
    >>>  for (counter = 0; counter < 20; counter++)
    >>>    printf("I am number %i\n", counter);
    >>>
    >>>  cr_request_file ("/usr/local/for_loop_0");
    >>>
    >>>  for (counter = 40; counter < 60; counter++)
    >>>    printf("I am number %i\n", counter);
    >>>
    >>>  cr_request_file ("/usr/local/for_loop_1");
    >>>
    >>>  return 0;
    >>> }
    >>>
    >>>
    >>> int callback (void* arg) {
    >>>  int rc;
    >>>
    >>>  rc = cr_checkpoint(CR_CHECKPOINT_READY);
    >>>  if (rc) {
    >>>    printf("We have been restarted\n");
    >>>  }
    >>>  else {
    >>>    printf("Dump generated.  We are continuing\n");
    >>>  }
    >>>  return 0;
    >>> }
    >>>
    >>
    >
    

  • Next message: Kris Buggenhout: "chekpoint support for amd64?"