[greg_at_bronevetsky_dot_com: Re: MPI support for BLCR]

jcduell_at_lbl_dot_gov
Date: Tue Feb 28 2006 - 17:45:26 PST

  • Next message: groups_at_1000islandtours_dot_com: "(no subject)"
    Greg,
    
    I'm going to ponder this for a little while before answering.  I'm also
    forwarding to our mailing list, so the other BLCR developers can think
    it over, too.
    
    I understand that your software layer intercepts all calls to MPI, and
    then runs some arbitrary MPI layer underneath it.  Could you tell me
    what happens to this underlying MPI layer when you checkpoint?  Do you
    kill it off (MPI_Finalize) and then recreate it all at restart
    (MPI_Init), transparently to the user?  If not, it's unclear to me how
    you're checkpointing the application without preserving any of the MPI
    libraries' "state" (could you be clear about what "state" you're talking
    about--sockets?  List of hostname/ports where the jobs are running?  All
    stack/heap state?).
    
    
    -- 
    Jason Duell             Future Technologies Group
    <jcduell_at_lbl_dot_gov>       Computational Research Division
    Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory
    
    
    ----- Forwarded message from Greg Bronevetsky <greg_at_bronevetsky_dot_com> -----
    
    From: Greg Bronevetsky <greg_at_bronevetsky_dot_com>
    Subject: Re: MPI support for BLCR
    Date: Tue, 28 Feb 2006 20:21:15 -0500
    To: JCDuell_at_lbl_dot_gov
    
    What you're describing mostly makes sense but I still don't understand 
    how LAM state was separated from application state. Does LAM not have 
    any MPI state in the application's address space and instead keeps it in 
    a separate process?
    
    Our checkpoint coordination approach is more complex than LAM's because 
    we don't require the network to be empty but intead keep track of 
    outstanding MPI messages and record them as necessary in our checkpoint. 
    (this was chosen because it is more scalable) Furthermore, we are not an 
    MPI implementation but rather a layer that runs between the application 
    and MPI, intercepting all MPI calls. As such, we can work with any 
    implementation of MPI. This of course poses some problems since if the 
    application is statically linked then the application, our layer and the 
    MPI implementation are all parts of the same process image and will all 
    get saved by a system like BLCR. This would be erroneous since there 
    would be a lot of MPI state that would be invalid on restart. Instead we 
    need a way to save just the application state and leave our layer and 
    the MPI implementation alone so that we can take care of it ourselves.
    
    We would be willing to modify aspects of our layer to make it more 
    compatible with BLCR but we cannot modify the underlying MPI 
    implementation since the whole point is for our system to work on any 
    MPI implementation. Would this type of checkpointing be possible with BLCR?
    
    -- 
                                Greg Bronevetsky
    
    >The LAM team used our callback notifications to shut down all TCP (or
    >other network) connections, so that when our checkpoint code ran, there
    >was no network state that needed to be saved.  They also arrange to save
    >the info they need to reconnect all the processes at startup.  Finally,
    >they also arranged it so that using our checkpoint program on their
    >'mpirun' (i.e the user's initial program to start the parallel MPI job)
    >caused mpirun to arrange for all other processes in the MPI job to be
    >checkpointed before mpirun itself returned from the callback and was
    >checkpointed.  In sum, our code just 'sees' that a single 'mpirun'
    >process is to be checkpointed.  Mpirun's callback contains all the logic
    >that ensures each job in the parallel job is checkpointed before it
    >itself is checkpointed.  Restart works the same way--mpirun's restart
    >callback handles restarting the entire parallel job.
    >
    >Needless to say, this wasn't transparent to the MPI library--they did a
    >lot of work to handle the parallel aspects.
    >
    >It sounds like your MPI library could be made to work with BLCR if you
    >can write a callback that shuts down any TCP/IP connections (and does
    >whatever other work you normally do for a checkpoint) right before
    >checkpoint time, and then restores them at restart.  This is
    >theoretically just a matter of writing two functions--a checkpoint-time
    >callback, and a restart-time callback.  How easy that is depends on
    >whether it's easy for you to close/reopen the network state.
    >
    >Does that make sense?
    >
    > 
    >
    
    ----- End forwarded message -----
    

  • Next message: groups_at_1000islandtours_dot_com: "(no subject)"