Re: Extending BLCR

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Sep 04 2007 - 15:14:57 PDT

  • Next message: Neal Becker: "blcr-0.6 dkms patch"
    Abhinav Jha wrote:
    > Dear Sir,
    >
    > Thank you for your kind reply. For the last few days, we have been going
    > through the BLCR code and are trying to figure out how a process is
    > checkpointed by BLCR. Is there a platform/forum where we could discuss
    > BLCR? We will try our best to club together our doubts in future so that
    > we don't cause you too much trouble ( we hope ).
    >   
    
    The address you are sending to (checkpoint_at_lbl_dot_gov) is a mailing list 
    including the BCLR developers and a few of the users.  It is the best 
    (and probably only) place to ask your questions.
    
    You can also find some explanation of how a process gets checkpointed in 
    the following two papers (also indexed on our website):
    
        * Duell, J., Hargrove, P., and Roman., E. */The Design and
          Implementation of Berkeley Lab's Linux Checkpoint/Restart./*
          Berkeley Lab Technical Report (publication LBNL-54941) 
          http://ftg.lbl.gov/CheckpointRestart/blcr.pdf
    
        * Paul H. Hargrove and Jason C. Duell */Berkeley Lab
          Checkpoint/Restart (BLCR) for Linux Clusters/* In Proceedings of
          SciDAC 2006: June 2006. (publication LBNL-60520) 
          http://ftg.lbl.gov/CheckpointRestart/LBNL-60520.pdf
    
    
    -Paul
    
    > Thanks once again,
    >
    > Abhinav Jha & Manish Kumar,
    > Indian Institute of Technology Guwahati
    > Guwahati -39, INDIA
    > http://www.iitg.ernet.in
    >
    >
    >
    >   
    >> Abhinav Jha wrote:
    >>     
    >>> Dear Sir,
    >>>
    >>> We're final year students from Indian Institute of Technology, Guwahati
    >>> (
    >>> http://www.iitg.ernet.in ), working on our B.Tech. project,
    >>> "Implementation of checkpoint and restart mechanism on the linux kernel
    >>> 2.6".
    >>>       
    >> Thank you for your interest in BLCR.  You will find my answers to your
    >> questions below.
    >>
    >>     
    >>> We wanted to make use of the already existing facilities of BLCR in this
    >>> regard. However, we're not aware of a few things:
    >>>
    >>> 1. Whether we can change your code without violating your copyright.
    >>>       
    >> BLCR is distributed under 2 Open Source Software licenses, the GPL and
    >> LGPL.  You should examine the license.txt files in each directory for
    >> information on which license applies to the files in that directory.
    >>
    >> The GPL allows you to modify the covered portions of BLCR provided that
    >> you distribute your modified version under the same GPL license.
    >>
    >> The LGPL allows slightly more freedom in how you may use the covered
    >> portions of BLCR.
    >>
    >> In either case, you should not have any problems if this is only for a
    >> class project.  If you plan to distribute the resulting enhancements to
    >> the general public, you should expect to simply apply the same licenses
    >> to the modified versions.  You don't need to obtain any permissions from
    >> us to do so.  However, if you do develop enhancements of general
    >> interest, we should talk about incorporating your changes back into the
    >> base BLCR code.
    >>
    >>     
    >>> 2. What is the feasibility of implementing socket checkpointing in BLCR.
    >>>       
    >> Good question.  We have not tried to pursue this task ourselves, and
    >> therefore have not tried hard to determine the exact level of
    >> difficulty.  Assuming you are interested only in Unix-domain (aka
    >> AF_LOCAL) sockets, I imagine the problems are small since the buffered
    >> data is all local to one node.  In the case of TCP, you can probably get
    >> away with preserving only the data that is buffered locally (both
    >> incoming and outgoing) and counting on retransmission to recover any
    >> data "on the wire" at checkpoint time.  The difficulty, however, is
    >> likely to come from getting the TCP state engine back to the right
    >> state.  For UDP, you can probably do the same as TCP.
    >>
    >> If you also want to attempt migration of TCP or UDP sockets, then you
    >> will need some way to "adjust" the peer as well.
    >>
    >>     
    >>> 3. Can we do an implementation of file checkpointing, that is
    >>> independent
    >>> of the one you have planned ?
    >>>       
    >> We have code in the soon-to-be-released 0.6.0 version of BLCR that takes
    >> care of checkointing of open-but-deleted files.  That code can easily be
    >> leveraged to checkpoint all open files, whether or not they are deleted.
    >>    The interesting part comes at restart time when you need to determine
    >> whether to use the checkpointed copy of a file or the copy that now
    >> exists on disk.  Depending on how a given application uses files (and
    >> how users of the application expect to use the files after the
    >> application runs) there is no single correct policy.  The implementation
    >> work to be done here is certainly simpler than socket checkpointing.
    >>
    >>     
    >>> 4. What would be a good way to go about reading/modifying the code ,
    >>> since
    >>> there is no manual avaiable ?
    >>>       
    >> I am afraid we don't have a good answer for this one.  We try to put
    >> comments in the kernel code that are sufficient for our own use when we
    >> look at code that another member of our group has written, or our code
    >> long after it was written.  However, it will take a good bit of time to
    >> learn the code just by reading it.  Alas, there is no documentation
    >> other than the code itself.
    >>
    >>     
    >>> We'll be very grateful to hear from you.
    >>>       
    >> Feel free to ask more questions if you need to.
    >>
    >>
    >>     
    >>> Thank you,
    >>>
    >>> Abhinav Jha & Manish Kumar,
    >>> Indian Institute of Technology Guwahati
    >>> Guwahati -39, INDIA
    >>> http://www.iitg.ernet.in
    >>>       
    >> -Paul
    >>
    >> --
    >> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >> Future Technologies Group
    >> HPC Research Department                   Tel: +1-510-495-2352
    >> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>
    >>     
    >
    >
    >   
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Neal Becker: "blcr-0.6 dkms patch"