Comments on paper

jcduell_at_lbl.gov
Date: Fri Feb 28 2003 - 11:52:52 PST


Sriram:

I just read the paper, and overall it's looking good.  Here are some
comments:

In the related work section, it was a bit unclear as to how CoCheck
works--I assume there needs to be an instance of the special checkpoint
process on each node (if not, how does the single process save the state
of processes on other nodes?).

In the issues section, you should qualify your remark that an MPI
checkpoint/restart "must" be transparent to users--tack on a phrase like
"to gain wide acceptance, ...".

I'd title Section 4 "Implementation of Checkpoint/Restart for LAM/MPI",
since you're not describing an implementation for any old MPI.

Section 4.1.  You should add a little bit of info about job startup
under LAM, i.e., that the lamd's are spawned separately and exist beyond
the lifespan of any particular MPI application (and are thus not
logically part of the MPI application: this will help when you explain
that they aren't part of the checkpoint).  And it would be good to
simply mention that one starts an MPI app on a single node with
'mpirun', and it causes the lamd's to spawn off the MPI apps and set up
their connections, etc.

It's also a little confusing in the initial text (though it becomes
clearer through the diagrams and later discussion) when you mention that
the LAM layer provides "a complete message passing service", then state
in the next sentence that the MPI layer does communication, too.
Putting in a parenthetical remark that the LAM msg service is for
out-of-band communication, and/or is slower than the MPI-provided
communication, would make this clearer.

Section 4.2:  As you know, we're planning to add to BLCR the ability to
checkpoint entire process groups and sessions.  So describing it as a
'the BLCR single-process checkpointer" muddies the waters.  Why don't we
fudge the issue by describing it as a systems that can "checkpoint
applications on a single node", which is vague enough to be true both
now and when we add support for sessions.

Section 4.3.  I'm not sure if saying that the three components involved
are mpirun, "the MPI library", and the TCP RPI is the way to go.  For
one thing, I was under the assumption that the RPI is part of the MPI
library.  You might be better off organizing the section headings via
the timeline of what goes on during a checkpoint (which is basically how
the text flows, anyway).  I.e., "user checkpoint mpirun", "lamd's
propagate checkpoint to MPI apps", "callback threads quiesces
application", etc.

4.3.1. I'd say "by invoking the BLCR utility cr_checkpoint with the
process id of mpirun".  This will make it clearer what cr_checkpoint
is.  

When you say that restart results in the MPI application being restarted
with the "same process topology", I assume you don't mean that they need
to be restarted on the same nodes as before.  Do they need to be
restarted with the same distribution across nodes?  I.e., if you
checkpointed some jobs that were running on single-processor nodes,
would you be able to restart them on SMPs and use all the processors?
Assuming you currently do shared memory optimizations on SMP nodes now,
would these work at restart, or would the MPI processes have been
hardcoded at initialization to think that they'd need to use TCP to
communicate with nodes that weren't on their node at startup?

You don't mention anywhere in section 4.3 that the lamds don't
themselves get checkpointed (although they participate in the checkpoint
by propagating the cr_checkpoint requests, and probably some other
things, too).  You need to mention this.

4.3.3  You state earlier in the paper (in the Issues sections) that
there are two strategies for handling the network in a
checkpoint--either to "checkpoint the network" (what does that mean?),
or drain all the data.  You should make it clear right at the beginning
of this section which approach LAM is using--right now you just say
'quiescing' the network, which doesn't make it clear.

Did you mean to say "blocking" when you write that "pending data can be
read from the sockets in a non-blocking fashion because it is known that
there are still outstanding bytes to be received."  You can *always*
read safely (i.e. w/o deadlock) with a non-blocking call--I would think
the guarantee that there are bits still coming would apply more to a
blocking call.

You mention that if a peer operation has not yet been posted, the
callback thread needs to signal the main thread to knock it out of its
blocking call.  I assume that at restart the MPI call is automatically
restarted without the user knowning about the interruption?  Mention
this.

You never mention in section 4 (or anywhere else in the paper) what
happens to the checkpoint files.  Does LAM keep them in some well-known
place?  Or is it up to the user/batch system to manage them?  Does the
user need to do anything besides call cr_restart on the mpirun context
file?

Section 5.  It would be good to state exactly what the causes of the
checkpoint overheads are--is it just the byte counting? 

To address the issue of how well the system will scale to arbitrary
numbers of nodes, you might want to lay out the Big O costs of all the
stages of the checkpoint/restart.  Right now I'm guessing you've got

    1) O(log N) propagation of cr_checkpoints via lamds to all MPI
       processes.
    2) Arbitrary (?) delay while you wait for any pending MPI calls to
       finish and relinquish the lock to the checkpoint callback
    3) O(log N) for sending bookmarks between all the processes
    4) Whoppingly huge but still O(1) time to write the checkpoint
       files, assuming they go to a local disk (or at least aren't all
       going to the same NFS filesystem).  This time will of course
       increase linearly with the memory and file size of the MPI
       processes.
    5) Shutdown cost?  Do the lamds do anything?

If you break down the costs, you can probably show that the file I/O
time dominates, and you could project the O(log N) network costs out to
show how many processes a LAM MPI job would need to have in order for
the network time to dominate.  Or maybe that's silly, since it might all
depend on how much data the application has on the wire (or in "fast
track" calls).  But it would be nice to show that the parallel portion
of CR is not actually the dominant cost, and that it should scale well. 

You should mention how many processes were used for the numbers in
Figures 3 and 4.  And do crunch the numbers for varying numbers of
nodes, as you clearly plan to do.

References:  I've been lazy and haven't yet updated the BLCR's text: the
authors should be Paul and Eric and myself, in alphabetical order.

Oh, and you should use LaTeX instead of MS Turd.  It should only take a
couple of minutes to convert it...

Cheers,

-- 
Jason Duell             Future Technologies Group
<jcduell_at_lbl_dot_gov>       High Performance Computing Research Dept.
Tel: +1-510-495-2354    Lawrence Berkeley National Laboratory