Re: MPI support for BLCR

jcduell_at_lbl_dot_gov
Date: Tue Feb 28 2006 - 14:05:02 PST

  • Next message: jcduell_at_lbl_dot_gov: "[greg_at_bronevetsky_dot_com: Re: MPI support for BLCR]"
    On Tue, Feb 28, 2006 at 03:09:11PM -0500, Greg Bronevetsky wrote:
    > I am a grad student at Cornell, working on checkpointing of MPI 
    > applications. Our checkpointer works with any implementation of MPI and 
    > (in principle) with any single-process checkpointer. However, in 
    > practice integration with single process checkpointers is made more 
    > complex because by default such a checkpointer will save the state of 
    > the entire process, including MPI state. This is generally incorrect as 
    > MPI state contains hardware information that will not be valid on restart.
    > 
    > I know that you've integrated BLCR with LAM, presumably in a way that 
    > doesn't save LAM's state but instead lets LAM save its own state. How 
    > did you do this? Was it via a special API (the callbacks referred to in 
    > your FAQ) or did you use a more general technique?
    
    The LAM team used our callback notifications to shut down all TCP (or
    other network) connections, so that when our checkpoint code ran, there
    was no network state that needed to be saved.  They also arrange to save
    the info they need to reconnect all the processes at startup.  Finally,
    they also arranged it so that using our checkpoint program on their
    'mpirun' (i.e the user's initial program to start the parallel MPI job)
    caused mpirun to arrange for all other processes in the MPI job to be
    checkpointed before mpirun itself returned from the callback and was
    checkpointed.  In sum, our code just 'sees' that a single 'mpirun'
    process is to be checkpointed.  Mpirun's callback contains all the logic
    that ensures each job in the parallel job is checkpointed before it
    itself is checkpointed.  Restart works the same way--mpirun's restart
    callback handles restarting the entire parallel job.
    
    Needless to say, this wasn't transparent to the MPI library--they did a
    lot of work to handle the parallel aspects.
    
    It sounds like your MPI library could be made to work with BLCR if you
    can write a callback that shuts down any TCP/IP connections (and does
    whatever other work you normally do for a checkpoint) right before
    checkpoint time, and then restores them at restart.  This is
    theoretically just a matter of writing two functions--a checkpoint-time
    callback, and a restart-time callback.  How easy that is depends on
    whether it's easy for you to close/reopen the network state.
    
    Does that make sense?
    
    -- 
    Jason Duell             Future Technologies Group
    <jcduell_at_lbl_dot_gov>       Computational Research Division
    Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    

  • Next message: jcduell_at_lbl_dot_gov: "[greg_at_bronevetsky_dot_com: Re: MPI support for BLCR]"