Re: BLCR 0.6.2 beta1 now available

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Dec 18 2007 - 13:19:58 PST

  • Next message: MAILER-DAEMON: "**Message you sent blocked by our bulk email filter**"
    Yuan Tang wrote:
    > Hi Paul,
    >
    > Thank you for the work. I downloaded the beta version, installed it 
    > and tested it. The SIGSTOPed process could pass the whole checkpoint 
    > procedure now. Congratulation! However, when I tried restarting the 
    > previously checkpointed SIGSTOPed process from its disk image, the 
    > RESTART procedure never completed. It blocks in 
    > cr_rstrt_req.c:cr_rstrt_child(). I guess, if you move the 
    > send_sig_info(SIGSTOP, NULL, task) stuff to cr_rstrt_task_complete(), 
    > the whole procedure will complete normally. Hope it helps.
    
    I believe the current behavior is correct (or at least is what I've 
    intended).  The process that was SIGSTOPed when the checkpoint was 
    requested is again SIGSTOPed when restarted.  To get it running again 
    you should be able to either send it a SIGCONT (which is tricky because 
    you might not know how soon to send it), or you can simply pass "--cont" 
    to cr_restart to have it done automatically.
    
    If you find that adding "--cont" to the cr_restart arguments still 
    doesn't allow the restart to complete, let us know and we'll see if we 
    can figure out what is going on.
    
    -Paul
    
    >
    > Best wishes!
    >
    > Yuan Tang
    >
    > ----- Original Message ----
    > From: Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>
    > To: checkpoint_at_lbl_dot_gov
    > Sent: Tuesday, December 18, 2007 4:23:30 AM
    > Subject: BLCR 0.6.2 beta1 now available
    >
    > The first beta of BLCR 0.6.2 is now available at
    > http://mantis.lbl.gov/blcr-dist/
    > Both source tarball and SRPM are available.  The filenames and MD5
    > checksums are:
    >   93249f20abd4eeec7a07db2f2a6cd2b2  blcr-0.6.2_b1.tar.gz
    >   e8ecba22c98de143ced20f83db76d8a1  blcr-0.6.2_b1-1.src.rpm
    >
    > This is a beta of a 0.6.2 patch release.  The intent of 0.6.2 is to fix
    > a small number of significant bugs found in 0.6.0 and 0.6.1 and to add
    > support for 2.6.23 kernels and some vendor-patched 2.6.22 kernels.  A
    > NEWS entry summarizing these changes appears below.
    >
    > You are receiving this e-mail either because you are subscribed to the
    > checkpoint_at_lbl_dot_gov <mailto:checkpoint_at_lbl_dot_gov> mailing list or because 
    > you have reported one of the
    > bugs or previously unsupported kernel versions addressed by this
    > release.  I apologize if you receive multiple copies.
    >
    > I would greatly appreciate any feedback (positive or negative)
    > indicating if this beta fixes any problems you have reported with BLCR
    > 0.6.0 and/or 0.6.1.  Only after I have sufficient positive feedback will
    > I make 0.6.2 available for download from the main BLCR web pages.
    >
    > -Paul
    >
    >
    > 0.6.2_b1
    > --------
    > December 17, 2007
    > Bug-fix and expanded-support release.
    > - This release adds support for 2.6.23 kernels.
    > - This release adds support for SuSE's 2.6.22.x kernels.
    > - This release fixes a file descriptor leak that occurred on restart from
    >   a checkpoint-of-self requested via cr_request_checkpoint().
    > - This release fixes a deadlock (and unkillable process(es)) when a
    >   multi-threaded process aborts (or omits itself from) a checkpoint
    >   under certain conditions.
    > - This release fixes a restart-time failure when a checkpoint includes a
    >   pipe with one end outside the checkpoint scope, and data is buffered
    >   in the pipe.
    > - This release fixes a bug with the cr_request{,_file}() calls in which
    >   a failed checkpoint would cause failure of the next one if it had the
    >   same destination file name.
    > - This release fixes a race condition with the cr_enter_cs() and
    > checkpoints
    >   in multi-threaded processes.
    > - This release fixes post-checkpoint signal delivery (--stop and friends)
    >   to occur after the checkpoint is fully completed.  See bug 2201 for
    >   a full description of the problems addressed by these changes.
    > - This release documents (and fully implements) signal-delivery options
    >   to cr_restart (see bug 2200).
    > - Adds test cases for most of the bugs fixed in this release.
    >
    >
    > -- 
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>
    > Future Technologies Group
    > HPC Research Department                  Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory    Fax: +1-510-486-6900
    >
    >
    >
    >
    > ------------------------------------------------------------------------
    > Never miss a thing. Make Yahoo your homepage. 
    > <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs> 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: MAILER-DAEMON: "**Message you sent blocked by our bulk email filter**"