Re: Error from re-start on very large context file

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Mon Jun 09 2008 - 00:49:33 PDT

Next message: Parviz Fariborz: "Re: Error from re-start on very large context file"

Previous message: Parviz Fariborz: "Error from re-start on very large context file"
In reply to: Parviz Fariborz: "Error from re-start on very large context file"
Next in thread: Parviz Fariborz: "Re: Error from re-start on very large context file"
Reply: Parviz Fariborz: "Re: Error from re-start on very large context file"

Parviz,

   The problem you describe does not sound like any know bug or limitation in 
BLCR.  It is likely that you have uncovered a new BLCR bug.
   The "Bad Address" (from the -14) is EFAULT, which suggests that some aspect 
of the restarted memory mapping is incorrect.  If there is any failure to 
allocate/map memory, then the vmadump portion of BLCR should be detecting the 
failure prior to causing an EFAULT by accessing the memory.
   It would help if you could reconfigure and build BLCR with "--enable-debug" 
passed to BLCR's configure script to enable detailed tracing.  If you load the 
modules by running "make insmod cr_ktrace_mask=0xffffffff" in you BLCR build 
directory, then the next time you try to restart from your 36G context file 
dmesg should provide some detail as to what was happening prior to the EFAULT. 
  Sending us the last 100 lines or so from dmesg should probably be sufficient 
for us to narrow the possible causes, and perhaps suggest a solution.

-Paul

Parviz Fariborz wrote:
> 
> Hi,
> 
> I get the following error when I re-start a context file produced by 
> blcr-0.7.0 :
> 
> => cr_restart context.21849
> Restart failed: Bad address
> 
> The dmesg command produces the following error :
> 
> blcr: Retry request on -CR_ENOSUPPORT
> blcr: thaw_threads returned error, aborting. -14
> 
> Any idea what I may be doing wrong? Is this a bug?
> 
> Several more pieces of info :
> 
> I run blcr on a 64 bit machine running linux red-hat :
> 
> =>uname -a
> Linux ivel6 2.6.9-67.ELsmp #1 SMP Wed Nov 7 13:56:44 EST 2007 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> Th size of the context file that produces the error is very large, 
> around 36G. When a checkpoint the same executable with an smaller data 
> set, which produce a smaller context file (around 3G) re-start works 
> with no problem.
> 
> Thanks in advance for your help.
> -Parviz
> 

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Parviz Fariborz: "Re: Error from re-start on very large context file"

Previous message: Parviz Fariborz: "Error from re-start on very large context file"
In reply to: Parviz Fariborz: "Error from re-start on very large context file"
Next in thread: Parviz Fariborz: "Re: Error from re-start on very large context file"
Reply: Parviz Fariborz: "Re: Error from re-start on very large context file"

Date view	Thread view	Subject view	Author view	Attachment view