Re: Document for Friday's mtg

From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Mon Jun 03 2002 - 10:26:24 PDT


NOW MOVING THE DISCUSSION TO THE LIST.


Jeff, I think you ARE missing something - sorry for confusing you.  When 
I refer to CHECKPOINT, CONTINUE and RESTART I am referring to blocks of 
code in the following handler template:

void handler(void* arg)
{
     int rc;

     /* do CHECKPOINT work here */

     rc = cr_checkpoint();
     if (CR_IS_FAILURE(rc)) {
         /* deal with FAILURE here (/
     } else if (CR_IS_RESTART(rc)) {
         /* do RESTART work here */
     } else {
         /* do CONTINUE work here */
     }
}

The cr_checkpoint() call is a return-twice call in the spirit of fork() 
or setjmp().  The first (chronologically) return is just continuing 
after the checkpoint has been taken.  The second return is when 
restarting from a checkpoint.

As for the stdin/out/err question, I am referring to the fd passing you 
mention.  The setup (mpirun passes fd to local lamd) must be repeated at 
  restart time because we have new fds to deal with.

-Paul


Jeff Squyres wrote:

> On Mon, 3 Jun 2002, Paul H. Hargrove wrote:
> 
> 
>>The main distinction between the CONTINUE and RESTART code for the
>>mpirun process has to do with file handles.  When we CONTINUE the mpirun
>>process is still connected to the local lamd by a unix domain socket and
>>that lamd has the proper stdin/out/err.  When we RESTART we must build a
>>new unix domain socket and must pass the stdin/our/err to the local
>>lamd.
>>
>>In the libmpi the situation is similar: all sockets in place (unless
>>using the shutdown trick) in the CONTINUE case - no sockets in place in
>>the RESTART case.
>>
> 
> (should we be using the checkpoint_at_lbl_dot_gov address for this thread?)
> 
> Not sure what you mean here...  Two things:
> 
> 1. What's the value of CONTINUE?
> 2. What do you mean by "the proper stdin/out/err"?
> 
> Longer explanations:
> 
> 1. The way I understand it, if you CONTINUE, you still get a bunch of
> image files as output, right?  Is the intent that these image files can be
> used later to restart the process?  e.g., for the scenario:
> 
>   Time   Description
>   ------ --------------------------------------------------------------
>   T=0    mpirun C foo
>   ...
>   T=N    foo does a checkpoint/CONTINUE
>   T=N+1  foo continues as if nothing had happened
>   ...
>   T=M    foo aborts/dies ungracefully
>   ...
>   T=P    user manually re-starts foo with the image files from the
>          checkpoint/CONTINUE at T=N
>   ------ --------------------------------------------------------------
> 
> Is that the intent?
> 
> If so, then for both CONTINUE and RESTART are supposed to turn out image
> files that are suitable for re-starting the process, right?  If that's
> right, then I think that libmpi and mpi need to do exactly the same thing
> in CONTINUE and RESTART.  Particularly in terms of the MPI data
> connections (in the RPI), but also the connection to the lamd's unix
> socket -- they need to be flushed and closed before the checkpoint occurs
> and then re-opened after the checkpoint resumes (for both the CONTINUE and
> RESTART cases).
> 
> If these connections are not flushed/closed, then the image files won't be
> able to be reliably used to restart the foo process.
> 
> 2. What does lamd have to with stdout/err/in?  The local lamd's stdout/err
> will always be tied to where lamboot was run, and its stdin is closed.
> All remote lamd's stdout/err/in are all closed.
> 
> Did you mean the stdout/err/in of the user application being tied to
> mpirun?  e.g., "mpirun C foo", how the stdout/err/in is tied to the
> originating mpirun?  If so, the input/output from foo is passed *through*
> the lamd, but in a very transparent way -- the lamd only handles the
> setup, and the rest is done transparently by the OS (using file descriptor
> passing from mpirun to the lamd).
> 
> So I'm not quite clear on what you mean...
> 
> -----
> 
> One clarification from my previous mail: Brian informs me that I was
> incorrect -- nsend/nrecv do *not* invoke malloc/free anywhere in their
> call stacks.  So we should be ok there.
> 
> {+} Jeff Squyres
> {+} [email protected]
> {+} http://www.lam-mpi.org/
> 
> 
> 
> 



-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998