Summary of 03JUN2002 C/R conference

From: Paul H. Hargrove (PHHargrove_at_lbl.gov)
Date: Mon Jun 03 2002 - 20:42:59 PDT


Summary of AG conference on 03 JUN 2002

In attendance:
    Jason Duell                  LBNL
    Brent Gorda                  LBNL
    Paul Hargrove                LBNL
    Eric Roman                   LBNL
    Andy Lumsdaine               IU
    Sriram Sankaran              IU, on loan to LBNL for Summer
    Jeff Squires (telephone)     LAM/MPI


One of the main goals of this conference was to make sure we could get
the Access Grid working for us, plus a phone-in.  It worked, after a
flurry of e-mail between the operators to get a virtual venue
selected.

The main content of the call was devoted to takling about how we might
let a checkpoint-aware library do its work outside of signal context.
The current design admits multiple checkpoint-aware libraries by
invoking the hendlers in an order opposite their order of
registeration, like at_exit().  This means there are no issues of
deadlock due to strange interactions between the libraries.  However,
there are a lot of problems related to non-reentrant code.  All of
libc is thread safe, but much is not reentrant.  Among the things not
reentrant is malloc().

Paul and Jason had a preliminary discussion before the call of how we
might let a handler "acknowledge" a checkpoint request, but indicate
that it would call back later to complete the actual checkpoint.  This
would allow the actual work to be done outside signal context.
Additionally, the signal handler would return so the application would
resume execution for a while.  This is important, for instance, if the
signal handler had run while the application was holding the malloc
mutex.  We are calling this mode of operation "asynchronous" - to mean
that quiescing the network can proceed (and complete) independent of
when the handler (in signal context) returns.

On this call we discussed in broad strokes the idea of the
asynchronous checkpoint and the fact that the major hurdle is avoiding
deadlock in the case of two or more checkpoint-aware libraries.  With
proper documentation we think just a well defined set of rule and
liberal use of "checkpoint critical sections" will PROBABLY get us
clear of this.

It was resolved to just pretend for the moment that we only need one
checkpoint-aware library and postpone detailed work on the deadlock
problem.  Instead, Paul will try to get somthing done ASAP which lets
Sriram develop the checkpoint handler for LAM/MPI outside of signal
context.

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
NERSC Future Technologies Group           Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-495-2998