Re: Error from re-start on very large context file

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jun 09 2008 - 00:49:33 PDT

  • Next message: Parviz Fariborz: "Re: Error from re-start on very large context file"
    Parviz,
    
       The problem you describe does not sound like any know bug or limitation in 
    BLCR.  It is likely that you have uncovered a new BLCR bug.
       The "Bad Address" (from the -14) is EFAULT, which suggests that some aspect 
    of the restarted memory mapping is incorrect.  If there is any failure to 
    allocate/map memory, then the vmadump portion of BLCR should be detecting the 
    failure prior to causing an EFAULT by accessing the memory.
       It would help if you could reconfigure and build BLCR with "--enable-debug" 
    passed to BLCR's configure script to enable detailed tracing.  If you load the 
    modules by running "make insmod cr_ktrace_mask=0xffffffff" in you BLCR build 
    directory, then the next time you try to restart from your 36G context file 
    dmesg should provide some detail as to what was happening prior to the EFAULT. 
      Sending us the last 100 lines or so from dmesg should probably be sufficient 
    for us to narrow the possible causes, and perhaps suggest a solution.
    
    -Paul
    
    Parviz Fariborz wrote:
    > 
    > Hi,
    > 
    > I get the following error when I re-start a context file produced by 
    > blcr-0.7.0 :
    > 
    > => cr_restart context.21849
    > Restart failed: Bad address
    > 
    > The dmesg command produces the following error :
    > 
    > blcr: Retry request on -CR_ENOSUPPORT
    > blcr: thaw_threads returned error, aborting. -14
    > 
    > Any idea what I may be doing wrong? Is this a bug?
    > 
    > Several more pieces of info :
    > 
    > I run blcr on a 64 bit machine running linux red-hat :
    > 
    > =>uname -a
    > Linux ivel6 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 
    > x86_64 x86_64 GNU/Linux
    > 
    > Th size of the context file that produces the error is very large, 
    > around 36G. When a checkpoint the same executable with an smaller data 
    > set, which produce a smaller context file (around 3G) re-start works 
    > with no problem.
    > 
    > Thanks in advance for your help.
    > -Parviz
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Parviz Fariborz: "Re: Error from re-start on very large context file"