Re: cr_restart: ->cri_syscall(CR_OP_RSTRT_REAP): Invalid argument

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Oct 11 2005 - 10:28:12 PDT

  • Next message: Neal Becker: "Re: BLCR 0.4.1 Beta5 now available"
    Replies appear below.
    
    Christian Iwainsky wrote:
    > Hello,
    > I have a problem, with the blcr.
    > I have written a distributed program, which is sucessfully checkpointed.
    > But once I try to restart the second instance on one machine of the
    > program, the cr_restart function aborts with:
    > cri_syscall(CR_OP_RSTRT_REAP): Invalid argument
    >
    > in /var/log/messages:
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    > Oct 11 14:15:40 faui21l kernel: vmadump: invalid signature
    > Oct 11 14:15:40 faui21l kernel: thaw_threads returned error, aborting.
    > -22
    >
    > What is the problem? (The Pid is free)
    
    The "invalid signature" means the contect file you are trying to restart
    from is either corrupted or possibly truncated.  I suspect that you have
    not succesfully checkpointed, but that the checpoint operation has
    failed without letting you know.  Is it possible that multiple processes
    might have been writing their checkpoints to the *same* file?  That
    would certainly result in a corrupted file.
    >
    > I also experience an interesting behaviour:
    > I use the following code for the checkpoint-callback:
    > dsm_checkpoint_read is initialized to 0
    >
    > /***********************************************************/
    > int chkpt_callback(void * aptr){
    > fprintf(stderr,"chkpt_callback\n");
    > if (!dsm_checkpoint_ready){
    >   // the checkpoint thread function is asleap ... don't checkpoint yet
    > but awa
    > ken the checkpoint thread
    >   dsm_checkpoint_sleep=0;
    >   // postpone the checkpoint till jackal has a consistant state
    >   fprintf(stderr,"Postponing checkpoint ..\n");
    >   //cr_checkpoint(CR_CHECKPOINT_READY);
    >   cr_checkpoint(CR_CHECKPOINT_TEMP_FAILURE);
    >   return 0;
    > }
    > fprintf(stderr,"checkpopint callback: taking checkpoint\n");
    > int chkptResult=cr_checkpoint(CR_CHECKPOINT_READY);
    > if (chkptResult>0){
    >   fprintf(stderr,"Restarting ...\n");
    >   dsm_checkpoint_wakeup=1;
    > } else if (chkptResult==0){
    >   fprintf(stderr,"checkpointing ........\n");
    > }else {
    >   fprintf(stderr,"Checkpoint Failure\n");
    >   cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE);
    >   return -1;
    > }
    > return 0;
    > }
    >
    > one the callback postponed the checkpoint the program state is brought
    > to a checkpoint state, and then the cr_request_file is called to do
    > the real checkpoint.
    > The program crashes on the call to cr_request_file:
    >
    
    It is not clear to me from your desciption how cr_request_file might be
    crashing.  I don't see anything wrong with your example except for your
    call to "cr_checkpoint(CR_CHECKPOINT_PERM_FAILURE)" in case of an error
    (you should just return -1, rather than calling cr_checkpoint a 2nd
    time).  However, since that code will only run if something is already
    "broken", I don't think it is the immediate cause of your problem.
    
    Is it possible for you to send a stack backtrace from a core file
    generated by this failure?  I could then get a better idea of what is
    wrong inside cr_request_file.
    >
    > Regards,
    > Christian
    
    -Paul
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Neal Becker: "Re: BLCR 0.4.1 Beta5 now available"