Re: thaw_threads returned error

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Sep 01 2005 - 11:32:09 PDT

  • Next message: Adolfo J. Banchio: "Re: Unresolved simbols error when trying to install BLCR modules"
    There are multiple things going on here.  See below.
    
    Adolfo J. Banchio wrote:
    > Hi again,
    > 
    > I've got modules loaded and I'm testing the BLCR for
    > f90 codes compiled by Intel Fortran F90. I can run
    > and "cr_checkpoint --term" the code (sometimes it does
    > not really kills the job), 
    
    I can be fairly certain that we do send SIGTERM to the process. 
    However, that is all we do and the process is free to ignore the signal. 
      One could use '--signal 9' to send an unignorable kill signal, which 
    would not allow the application to perform any cleanup (but sometimes we 
    don't want the cleanup, which could delete files needed for the restart).
    
    > but when restarting,
    > also sometimes happens that I get the following
    > messages in /var/log/messages 
    > 
    >  kernel: vmadump: mmap failed:
    > /home/adolfo/progs/sd/bd/f90/intelf90/sdbd_nf2g.x (deleted)
    > 
    >  kernel: thaw_threads returned error, aborting. -1
    > 
    
    What this tells me is that the application has created the file named 
    above and mmaped it.  However, at some point *before* the checkpoint was 
    taken the file was deleted (so "--signal 9" won't help).  This is a 
    perfectly legal thing to do, and the kernel will remove the directory 
    entry immediately and will delay removing the file contents until the 
    file is no longer mmaped.  Unfortunately, that means that by the time we 
    go to restore, the file is gone.
    
    There is very little I can do about this immediately, except to move the 
    error to checkpoint time to avoid "false hopes" of restarting.
    
    In the longer term we do plan to explicitly deal with deleted files.
    
    > 
    > and the cr_restart stop with "killed".
    > After this, if I try again to restart it would give
    > "cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy",
    > but the PID is free. And in /var/log/messages appears
    > 
    > kernel: cr_rstrt_request_restart [14041]:  PID conflict found by
    > cr_reserve_ids()
    > 
    
    This is a sign of a blcr bug.  We probably allocated the pid for the 
    process and then failed to de-allocate it when the restart failed due to 
    the mmap problem.  Unfortunately, the particular pid is lost until the 
    next reboot - though this should not have any bad effect except 
    preventing restarting from this particular checkpoint.  I have seen 
    similar "lost pids" when I've had more serious restart failures (such as 
    a kernel Oops).  In those cases I was not able to track it down, but 
    your bug report gives me a way to reproduce this so I can figure out 
    where we lose track of the pid.
    
    
    > This happens no on every checkpoint/restart, but very frequently.
    
    "very frequently" probably means that the few times it did work you got 
    lucky that no mmaped-but-deleted files existed at the instant you 
    checkpointed.
    
    
    > thanks in advance for any hint or help.
    
    Thank you for the bug report.  I am sorry that I don't currently have 
    any way to help you to checkpoint/restart your application.
    
    -Paul
    
    > adolfo
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Adolfo J. Banchio: "Re: Unresolved simbols error when trying to install BLCR modules"