Re: multiple checkpoints

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Mar 30 2005 - 12:58:05 PST

  • Next message: Richard Hu: "Re: multiple checkpoints"
    Richard,
    
      I don't have a certain answer, but I can guess.  I suspect that when 
    you see the hang BLCR is trying to retire the first checkpoint before 
    starting the second.  When restarted from the 1st checkpoint the user 
    space part of BLCR believes that there is a previous checkpoint to 
    retire, but the kernel disagrees.
      I've entering a bug report at 
    http://mantis.lbl.gov/bugzilla/show_bug.cgi?id=1037
    
    -Paul
    
    Richard Hu wrote:
    
    > To Whom It May Concern:
    >
    > I appear to be having an issue with multiple checkpoints in BLCR and I 
    > was wondering if you could perhaps shed some light on the problem.  I 
    > have attached a simple test program to demonstrate my problem.  
    > Essentially when I run the program, two checkpoints are generated with 
    > some activity happening between the checkpoints.  When I restart from 
    > the second checkpoint (for_loop_1), everything works.  When I restart 
    > from the first checkpoint (for_loop_1), the program hangs when it hits 
    > the spot in the program where it attempts to create the second 
    > checkpoint.  Do you know why this happens?  Is there a possible 
    > work-around?
    >
    > Thanks,
    > Richard Hu
    > rhu_at_opnet_dot_com
    >
    >------------------------------------------------------------------------
    >
    >#include <stdio.h>
    >#include <stdlib.h>
    >#include "libcr.h"
    >#include <math.h>
    >#include <string.h>
    >
    >int callback(void *arg);
    >
    >int main (void) {
    >  int counter;
    >  char path[100] = "/usr/local/for_loop_";
    >  char num[20];
    >  
    >  cr_init();
    >  cr_register_callback(callback, NULL, CR_THREAD_CONTEXT);
    >  counter = 0;
    >
    >  for (counter = 0; counter < 20; counter++)
    >    printf("I am number %i\n", counter);
    >
    >  cr_request_file ("/usr/local/for_loop_0");
    >
    >  for (counter = 40; counter < 60; counter++)
    >    printf("I am number %i\n", counter);
    >
    >  cr_request_file ("/usr/local/for_loop_1");
    >
    >  return 0;
    >}
    >
    >    
    >int callback (void* arg) {
    >  int rc;
    >  
    >  rc = cr_checkpoint(CR_CHECKPOINT_READY);
    >  if (rc) {
    >    printf("We have been restarted\n");
    >  }
    >  else {
    >    printf("Dump generated.  We are continuing\n");
    >  }
    >  return 0;
    >}
    >  
    >
    

  • Next message: Richard Hu: "Re: multiple checkpoints"