Re: "Permission denied" error

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Apr 01 2008 - 12:07:43 PST

  • Next message: me043055_at_mnnit.ac.in: "process checkpinting"
    Yuan,
    
       Sorry for the delay in getting back to you.  I had to ask a colleague 
    to install R for me and then I left on travel about the time that was 
    finished.
       I tried today with BLCR 0.6.5 and was able to checkpoint and restart 
    the script you provided.  I verified that 
    /usr/lib64/gconv/gconv-modules.cache was mmapped (it was not when I had 
    LANG=C in my environment, but changing it LANG=en_US.UTF-8 caused it to 
    be mmapped).
       Since I cannot reproduce your problem, I am not sure what I can do at 
    this point to help you.  If you have any ideas about what makes your 
    system different, please let me know.
    
       While not related to a "permission denied" error, it is worth nothing 
    that your test script looks at wallclock time, which BLCR does not 
    "virtualize".  So if I restart more than 180 seconds after the original 
    program began, then I get only a single ">" line as output.  Not exactly 
    a problem, but I was confused by it initially.
    
    -Paul
    
    Yuan Wan wrote:
    > 
    > 
    > Paul,
    > -------------------------------------------------------------------------------------- 
    > 
    > $ ls -l /usr/lib64/gconv/gconv-modules.cache
    > -rw-r--r--  1 root root 21546 Oct  2 14:51 
    > /usr/lib64/gconv/gconv-modules.cache
    > $ tcsh -c 'cat /proc/$$/maps' | grep gconv
    > 2a9892f000-2a98935000 r--s 00000000 08:01 522135   
    > /usr/lib64/gconv/gconv-modules.cache
    > --------------------------------------------------------------------------------------- 
    > 
    > 
    > I cannot see any difference on permission.
    > 
    > Can you restart my test script from checkpoint on your machine?
    > 
    > -------------------------------------------
    > #!/bin/sh
    > 
    > PATHTOR=/usr/bin
    > # Below, the phrase "EOF" marks the beginning and end of the HERE document.
    > $PATHTOR/R --no-save  <<EOF
    > mod<-function (x, y)
    > {
    >     x1 <- trunc(trunc(x/y) * y)
    >     z <- trunc(x) - x1
    >     z
    > }
    > 
    > z0 <- unclass(Sys.time())
    > 
    > repeat{
    > 
    > z1<-unclass(Sys.time())
    > secs<-floor(z1-z0)
    > if (mod(secs, 10)==0) print(secs)
    > if ((secs)>180) break
    > 
    > }
    > EOF
    > 
    > -------------------------------------------
    > 
    > 
    > 
    > --Yuan
    > 
    > 
    > 
    > On Fri, 14 Mar 2008, Paul H. Hargrove wrote:
    > 
    >> Yuan,
    >>
    >> What do you get if you run the following two commands?
    >> $ ls -l /usr/lib64/gconv/gconv-modules.cache
    >> $  tcsh -c 'cat /proc/$$/maps' | grep gconv
    >>
    >> What I see is a world readable file and a shared read-only mmap in tcsh:
    >> $ ls -l /usr/lib64/gconv/gconv-modules.cache
    >> -rw-r--r--  1 root root 21514 Jun  3  2005 
    >> /usr/lib64/gconv/gconv-modules.cache
    >> $ tcsh -c 'cat /proc/$$/maps' | grep gconv
    >> 2b8e36967000-2b8e3696d000 r--s 00000000 00:0f 9486631 
    >> /usr/lib64/gconv/gconv-modules.cache
    >>
    >> So, there shouldn't be a problem unless there is something different 
    >> about your system.
    >>
    >> -Paul
    >>
    >> Paul H. Hargrove wrote:
    >>> Yuan,
    >>>
    >>>  I've not seen that particular failure before, but some quick research
    >>> indicates that gconv-modules.cache is a part of glibc and I suspect that
    >>> it is getting mapped in much the same way as the NCSD file is.  I will
    >>> continue to look into the problem to see what BLCR might be able to do
    >>> differently,
    >>>
    >>> -Paul
    >>>
    >>> Yuan Wan wrote:
    >>>
    >>>> Hi Paul,
    >>>>
    >>>> Thanks for replying.
    >>>> The error messege I got from /var/log/messeges is as the following:
    >>>>
    >>>> vmadump: mmap failed: /usr/lib64/gconv/gconv-modules.cache
    >>>> thaw_threads returned error, aborting. -13
    >>>>
    >>>> The failure seems not caused by NSCD. What do you think?
    >>>>
    >>>> --Yuan
    >>>>
    >>>>
    >>>> On Mon, 10 Mar 2008, Paul H. Hargrove wrote:
    >>>>
    >>>>
    >>>>> Yuan,
    >>>>>
    >>>>>  The most likely cause is that the restart failed to open one of the
    >>>>> files that was open() or mmap()ed at the time the checkpoint was 
    >>>>> taken.
    >>>>> Based on the fact that you see this w/ a shell script, but not C code,
    >>>>> my best guess is that you are encountering a problem with the file 
    >>>>> that
    >>>>> the Name Service Cache Daemon (NSCD) uses.  Please see the 
    >>>>> following FAQ
    >>>>> entry for more detail (including what to look for in the system logs)
    >>>>>  http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#nscd
    >>>>> The only known work-around is to remove NSCD from your system.
    >>>>>
    >>>>> -Paul
    >>>>>
    >>>>> Yuan Wan wrote:
    >>>>>
    >>>>>> Hi all,
    >>>>>>
    >>>>>> I'm trying to restart my shell script jobs (bash and R) with BLCR but
    >>>>>> failed with the following error:
    >>>>>>
    >>>>>> "Restart failed: Permission denied"
    >>>>>>
    >>>>>> I can checkpoint the job and get context file. The restart will be
    >>>>>> successful if executed by root but fail if run by normal users. The
    >>>>>> context file does belongs to me, so I'm wondering where the 
    >>>>>> permission
    >>>>>> is required. I can also restart a C code as a regular user without
    >>>>>> problem.
    >>>>>>
    >>>>>> Anyone know the possible reason? Thanks
    >>>>>>
    >>>>>> --Yuan
    >>>>>>
    >>>>>> Yuan Wan
    >>>>>>
    >>>>>
    >>>>>
    >>>
    >>>
    >>>
    >>
    >>
    >>
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: me043055_at_mnnit.ac.in: "process checkpinting"