Re: Asynchronous checkpointing support in BLCR

From: Josh Hursey (jjhursey_at_open-mpi.org)
Date: Tue Feb 27 2007 - 06:44:56 PST

  • Next message: Yiannis Georgiou: "blcr-0.5.0_b5 cr_run execution error"
    On Feb 27, 2007, at 8:15 AM, Rajagopal Natarajan wrote:
    
    > Hi,
    >
    > I'm working on a 10 node P3 cluster, and use BLCR on it. I would  
    > like to know if BLCR has any existing support for asynchronous  
    > checkpointing.
    
    What do you mean by "asynchronous checkpointing"?
    
    BLCR supports command line tools cr_checkpoint and cr_restart, which  
    will start a checkpoint inside an application that is properly liked  
    with BLCR. The application does not have to add any code in order to  
    be supported. So you could call that asynchronous checkpointing (and  
    some do).
    
    If when you say "asynchronous checkpointing" you mean using an  
    Uncoordinated Checkpoint/Restart Coordination Protocol this is a bit  
    higher level than BLCR since it implicitly requires knowledge of a  
    multi-process environment in which processes may or may not be  
    located on the same machine. For this you need to look at building on  
    top of the existing BLCR infrastructure in something like an MPI  
    implementation as you note below.
    
    >
    > If the answer is yes, please point me to the appropriate docs.
    
    I'd start with the users guide: :)
    http://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Users_Guide.html
    
    >
    > If the answer is no, I would like to implement asynchronous  
    > checkpointing in LAM-MPI.
    
    LAM/MPI already incorporates an asynchronous checkpointing feature,  
    meaning command line tools are exposed so you can checkpoint a MPI  
    program with BLCR without modifying the MPI program. LAM/MPI uses a  
    Coordinated Checkpoint/Restart Coordination Protocol, and supports  
    checkpointing with TCP and GM (Myrinet).
    
    > Please tell me if i can make use of BLCR and modify the code to do  
    > that, and how much of code might need to be modified. Would it be  
    > feasible to implement it in 1-1.5 months, with two developers  
    > working part time on it(Myself and my classmate, who both are  
    > working on our bachelors thesis on checkpointing in LAM-MPI based  
    > clusters and avoidance of rollback propagation. As we have other  
    > course work, we might be able to devote upto 4-5 hrs on this project).
    
    If you intend to pursue an Uncoordinated Checkpoint/Restart  
    Coordination Protocol in LAM/MPI it may take a few months or even a  
    few years depending on quite a few factors. Most notably among those  
    factors are familiarity with the LAM/MPI code base, the Uncoordinated  
    C/R literature, and experience of the developers. You will need to  
    become familiar with the LAM/MPI code base specifically how the  
    current Coordinated C/R Coordination Protocol works. In addition the  
    Uncoordinated C/R Coordination Protocols can become quite complex in  
    their reconstruction of the multiprocess environment upon restart  
    (especially with out using Message Logging techniques) this will add  
    significantly to the time spent developing code.
    
    >
    > If the above project is not feasible in the specified time of 1-1.5  
    > months with 2 developers working on it, suggest us a something that  
    > we can contribute to BLCR which would be related to avoidance  
    > rollback propagation.
    
    Rollback propagation is a concept involving multiple processes using  
    (mainly) Uncoordinated Checkpoint/Restart Protocols. Since BLCR is a  
    single process checkpoint/restart service you are really looking to  
    do something building upon it in a distributed process environment  
    (like MPI provides for example). I may be wrong, but I think what you  
    are looking for is an MPI implementation to experiment with. LAM/MPI  
    is one option, and has some of the groundwork already laid out but  
    certainly not all.
    
    BLCR's newest feature of being able to checkpoint/restart process  
    groups within a single machine might be another, smaller area that  
    you could look at. Meaning looking at how to checkpoint/restart a  
    process group that communicates via shared memory using an  
    Uncoordinated C/R Coordination Protocol or something like it.
    
    -- Josh
    
    >
    > Thanks.
    >
    > -- 
    > N. Rajagopal,
    > Visit me at http://users.kaski-net.net/~raj/
    

  • Next message: Yiannis Georgiou: "blcr-0.5.0_b5 cr_run execution error"