Re: Hang in cr_restart

From: Karthik Gopalakrishnan (gopalakk_at_cse.ohio-state.edu)
Date: Thu Jan 29 2009 - 01:31:53 PST

  • Next message: Andrea Autiero S143785: "run blcr on simics virtutech"
    Hi Paul.
    
    Thanks. That confirms what I suspected. Even a Ctrl+C does not work
    after restart. And I think I understand what you are saying wrt not
    calling the do_real_work() function from the CR Callback. I will
    restructure my program to avoid that. Could you please point me to a
    suitable example in BLCR's 'tests' directory.
    
    Thanks & Regards,
    Karthik
    
    On Thu, Jan 29, 2009 at 3:34 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov> wrote:
    > I think the root of your problem is that BLCR invokes its callbacks with all
    > signals blocked.  This is preventing SIGCHLD from being delivered.  You
    > could unblock the signal yourself, but that is probably not the way to go
    > (though I can't say for sure not seeing the full application).  I think that
    > perhaps you are not using the callback as we had intended (though I admit
    > our documentation is a little "thin").  It was not our intention that the
    > "normal" flow of your application would pickup in the callback, as your call
    > to do_real_work() appears to.  Instead it would be proper for the callback
    > to raise some signal or otherwise "tell" the normal application flow (which
    > is, I believe, currently just "while(1)") to do something.
    >
    > It is probably also worth noting that the child created by fork() inherits
    > the signal mask of the parent, which in your case means the one spawned by
    > the do_real_work() call in CR_Callback() is going to run with all signals
    > blocked just as the callback does.
    >
    > Let us know if I have not been clear, or if you need more help.
    >
    > -Paul
    >
    > Karthik Gopalakrishnan wrote:
    >>
    >> Hello.
    >>
    >> I apologize for the long mail in advance. :-)
    >>
    >> I have an application which roughly works as follows:
    >>
    >> main()
    >> {
    >>    do_cr_initialization();
    >>    do_real_work();
    >>  }
    >>
    >> do_real_work()
    >> {
    >>   register(SIGCHLD_Handler);
    >>   fork();
    >>    if (child) {
    >>        do_stuff();
    >>        exit(0);
    >>    }
    >>    while(1);
    >> }
    >>
    >> SIGCHLD_Handler()
    >> {
    >>    wait_for_child();
    >>    exit(0);
    >> }
    >>
    >> CR_Callback()
    >> {
    >>    if (restarting)
    >>        do_real_work()
    >> }
    >>
    >> do_stuff() is intelligent enough to continue from where it left off.
    >> Now, under normal execution, after the do_stuff() completes & exit(0)
    >> is called, SIGCHLD_Handler() is invoked which terminates the
    >> application. However, when cr_restart is called after a checkpoint,
    >> the application just "hangs" after do_stuff() completes the remaining
    >> work & calls exit(0). SIGCHLD_Handler() is not invoked at restart at
    >> all. The output of 'ps' shows the following:
    >>
    >> UID        PID  PPID  C STIME TTY      CMD
    >> gopalakk 11886 12020  0 20:30 pts/0    a.out
    >> gopalakk 12020 10333  0 20:30 pts/0    cr_restart context.11886
    >> gopalakk 12026 11886  0 20:30 pts/0    [a.out] <defunct>
    >>
    >> Can someone explain what's going on here.
    >>
    >> Thanks & Regards,
    >> Karthik
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    >
    

  • Next message: Andrea Autiero S143785: "run blcr on simics virtutech"