Re: Question about "fd" token

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jun 08 2009 - 12:18:54 PDT

  • Next message: Neal Becker: "checkpoint-restart in mainline kernel"
    Allow me answer your questions out of order, to provide the clearest
    explanation:
    
    + "how the other tasks in the req->task_list been dumped?"
    
    When a checkpoint is requested for a process, the BLCR kernel module
    sends each thread in that process an unblockable signal. The signal
    handler for that signal is in the libcr code that is either linked
    explicitly into the application, or is loaded via LD_PRELOAD. That
    signal handler ensures that every thread that has been included in the
    request will eventually call do_checkpoint. That call will come from
    cr_checkpoint() if the thread has interacted with libcr to cause a
    thread-specific info to be allocated. If no thread-specific info is
    allocated for a given thread, then the signal handler calls
    do_checkpoint() directly.
    
    + "What does these callbacks exactly used to do? To provide the user of
    blcr with interface to do something relative to the specified
    application program? Or just used to do the real checkpoint stuff?"
    
    As you noticed, before calling into the kernel via do_checkpoint() to
    perform the real work of saving the checkpoint, the code in
    cr_checkpoint() runs a stack of callbacks. These callbacks are, as you
    guessed, "to provide the user of blcr with interface to do something
    relative to the specified application program." The common motivating
    example use of a callback is for a distributed application to save
    information about the state of its communication (since BLCR does not
    save socket state, or any other network info).
    
    + Regarding my_cb() in cr_checkpoint.c:
    
    This callback in the cr_checkpoint utility program is *not* related to
    checkpointing of another process. Therefore, it may have confused or
    mislead you. This callback is used in the cr_checkpoint utility to make
    sure that if *cr_checkpoint* is checkpointed that the checkpoint it
    requested has either completed or not started yet. We do that by using a
    pthread mutex to be certain that checkpointing of cr_checkpoint and
    checkpointing the requested process(es) are mutually exclusive (if that
    is impossible because the cr_checkpoint has requested a checkpoint that
    includes itself, we OMIT it from the checkpoint to avoid deadlock). This
    is an example of "do something relative to the specified application
    program". In this case the application is the cr_checkpoint utility and
    the "do something" is making sure that we get "all or nothing" from the
    checkpoint request (because the request is not restartable across the
    checkpoint).
    
    -Paul
    
    ����� wrote:
    > Hello, Professor:
    >
    > I have read the paper"The Design and Implementation of Berkeley Lab��s
    > Linux
    > Checkpoint/Restart" for several times and intervals between these
    > times I was reading the source code.However, until now I still can not
    > understand the user library "callback" mechanism.
    >
    > What does these callbacks exactly used to do? To provide the user of
    > blcr with interface to do something relative to the specified
    > application program? Or just used to do the real checkpoint stuff?
    >
    > In checkpoint.c ,I noticed that before we issue the request to build
    > the task list that we want to checkpoint, the callback "my_cb" was
    > registered:
    > /* Register our callback */
    > cb_id = cr_register_callback(&my_cb, NULL, CR_THREAD_CONTEXT);
    >
    > my_cb() invokes cr_checkpoint():
    >
    > cr_checkpoint() will not invoke do_checkpoint()to do the real dump
    > work until all the callbacks in the callbacks array which is got from
    > the thread-specific info of current thread.
    >
    > so what does these callback used to do?
    >
    > By the way, the function cr_dump_self() seems to dump only the current
    > process.how the other tasks in the req->task_list been dumped?
    >
    >
    >
    >
    >
    >
    > ===============================================
    > ��������һ������TOM�������ɣ���������1.5G������ʲô��
    > <http://bjcgi.163.net/cgi-bin/newreg.cgi?%0Arf=050602>
    > ===============================================
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: Neal Becker: "checkpoint-restart in mainline kernel"