Re: Bug Report: Stale process after abortion of restart process

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Feb 24 2005 - 11:29:15 PST

  • Next message: Michael Klemm: "Re: Bug Report: Stale process after abortion of restart process"
    Michael,
       Thanks for the bug report.  Since you report that the machine 
    requires a reset, I am certain the process is stuck in the kernel and 
    your example greatly narrows down where the problem lies.  When 
    performing a checkpoint or a restart, we keep a count of the number of 
    threads in the process and how many of them have responded to the 
    checkpoint to ensure they are all idle when we start writting the 
    checkpoint file.
       Because one thread (the one running the aborting callback) has 
    exited, the counts will never be equal.  We deal with this possibility 
    at checkpoint time by having a "watchdog" task that wakes once per 
    minute to look for tasks that are part of a checkpoint but have exited. 
      If any are found then we adjust the thread counts.  We don't currently 
    do this for a restart, but we probably should.  I am uncertain about why 
    this worked in a 2.4 kernel but not a 2.6.  This is the first thing I 
    will look into.
       The one thing that confuses me is the fact that some uniterruptible 
    process was consuming 100% of CPU.  There should be no spin waits in the 
    kernel's checkpoint or restart code paths, so this may be an indication 
    of a larger problem.  There is a spinwait for thread synchronization in 
    user space that you may have encountered.  At that point the process 
    probably has all signals blocked, making it *almost* uninterruptible. 
    Could you please determine if sending SIGKILL (an unblockable signal) is 
    capable of killing the cpu-consuming process?  That would narrow down 
    some things for me.
    
    Michael Klemm wrote:
    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA1
    > 
    > Hi,
    > 
    > playing aroud I found the following bug:
    > 
    > | DONE
    > |  FINISHED chkpt_callback
    > |            entering level 6 (return address 0x80486f7)
    > |               entering level 7 (return address 0x80486f7)
    > |   cr_core.c:467 cr_checkpoint: Callback 0 returned 1 - ABORTING
    > |
    > |               entering level 8 (return address 0x80486f7)
    > 
    > After printing the error message, the process freezes and continuously
    > wastes 100% CPU and is uninterruptable. Also, I'm not able to shutdown
    > the machine. Instead, I'm forced to press the machine's reset button.
    > 
    > The machine is as P4 3.06 HT DUAL running SuSE Linux, kernel version
    > 2.6.5-7.145-smp. I checked the same program on my other Linux box
    > running kernel 2.4.29. On this machine, BLCR works fine although it also
    > reports an aborted restart process (that's the correct behavior).
    > 
    > Regards
    >     -michael
    > 
    > - --
    > Computer Science Department 2, University of Erlangen-Nuremberg
    > Martensstrasse 3, D-91058 Erlangen, Germany
    > phone: ++49 (0)9131 85-28995, fax: ++49 (0)9131 85-28809
    > web: http://www2.informatik.uni-erlangen.de/~klemm
    > -----BEGIN PGP SIGNATURE-----
    > Version: GnuPG v1.2.4 (GNU/Linux)
    > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
    > 
    > iD8DBQFCHcy9WEu1syWqdn0RAq78AKCwZBky/zTtkJjjp4sFO1V3k6jNFQCffIiP
    > SFsXyOKmAAQ0f0XTfB4O3hE=
    > =i3zw
    > -----END PGP SIGNATURE-----
    > 
    > 
    > ------------------------------------------------------------------------
    > 
    > #include <stdio.h>
    > #include <string.h>
    > #include <sys/types.h>
    > #include <sys/stat.h>
    > #include <fcntl.h>
    > #include <unistd.h>
    > #include "libcr.h"
    > 
    > int recursive(int level) {
    >     int i;
    >     int result;
    >     for(i = 0; i < level * 2; i++) {
    > 	fprintf(stderr, " ");
    >     }
    >     fprintf(stderr, "entering level %d (return address 0x%x)\n", 
    > 	    level, __builtin_return_address(0));
    >     
    >     if (level == 10) {
    > 	result = 1;
    >     }
    >     else {
    > 	if (level == 5) {
    > 	    fprintf(stderr, "WAITING FOR USER TO CHECKPOINT...\n");
    > 	    sleep(60);
    > 	    fprintf(stderr, "DONE\n");
    > 	}
    > 	result = recursive(level+1) + level;
    >     }
    > 
    >     for(i = 0; i < level * 2; i++) {
    > 	fprintf(stderr, " ");
    >     }
    >     fprintf(stderr, "leaving level  %d\n", level);
    > 
    >     return result;
    > }
    > 
    > #if 0
    > int chkpt_callback(void *arg) {
    >     fprintf(stderr, "BLCR CALLED %s(0x%x)\n", __FUNCTION__, (unsigned int)arg);
    >     int result = cr_checkpoint(CR_CHECKPOINT_READY);
    >     fprintf(stderr, "FINISHED %s\n", __FUNCTION__);
    >     if(result > 0)
    > 	return 0;
    >     return result;
    > }
    > #endif
    > 
    > int chkpt_callback(void *arg) {
    >     fprintf(stderr, "BLCR CALLED %s(0x%x)\n", __FUNCTION__, (unsigned int)arg);
    >     int result = cr_checkpoint(CR_CHECKPOINT_READY);
    >     fprintf(stderr, "FINISHED %s\n", __FUNCTION__);
    >     return result;
    > }
    > 
    > int main(int argc, char **argv) {
    >     cr_init();
    >     cr_register_callback(chkpt_callback, NULL, CR_THREAD_CONTEXT);
    > 
    >     fprintf(stderr, "MY PID IS %d\n", getpid());
    > 
    >     fprintf(stderr, "\n\nresult: %d\n", recursive(0));
    > 
    >     return 0;
    > }
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Michael Klemm: "Re: Bug Report: Stale process after abortion of restart process"