Re: berkeley checkpointing

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Jan 02 2008 - 13:13:50 PST

  • Next message: Jerry Mersel: "Re: berkeley checkpointing"
    Jerry,
    
      Let's try to deal with one problem at a time.  First I'd like to 
    address the "relocation error" and see if resolving it still leaves the 
    second error.
      The purpose of cr_run is to set LD_PRELOAD just as you have done 
    manually.  If you could, please tell me if the following two commands 
    (executed via SGE) each produce the same relocation error:
    
    ${BLCR_HOME}/bin/cr_run matlab -nojvm -nodisplay -nosplash < $H/test.m
    env LD_PRELOAD=libcr.so.0:libpthread.so.0 matlab -nojvm -nodisplay 
    -nosplash < $H/test.m
    
    If you could, also send the output of "env LD_PRELOAD=libpthread.so.0 
    ldd /bin/cat" executed both from the command line and via SGE.
    
    -Paul
    
    Jerry Mersel wrote:
    > I manage to checkpoint matlab processes  from the command line.
    > But when I want to use SGE I get the error:
    > /lib64/libc.so.6: relocation error: /lib64/tls/libpthread.so.0: symbol 
    > errno, version GLIBC_PRIVATE not defined in file libc.so.6 with link 
    > time reference
    > Restart failed: No such device or address
    >
    > The relocation error I get on the start using cr_run.
    > The Restart failed I get when trying to restart.
    >
    > I start matlab thus:
    > ${BLCR_HOME}/bin/cr_run env LD_PRELOAD=libcr.so.0:libpthread.so.0 
    > matlab -nojvm -nodisplay -nosplash < $H/test.m
    >
    > and try to restart thus:
    > ${BLCR_HOME}/bin/cr_restart $ckptfile
    >
    > my log file says this:
    > Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    > Jan  2 14:24:36 kam02 kernel: Skipping a socket.
    > Jan  2 14:26:03 kam02 kernel: Failed to open chrdev major=5 minor=0 
    > path='/dev/tty')
    > Jan  2 14:26:03 kam02 kernel: cr_restore_all_files [28703]:  Unable to 
    > restore fd 3 (type=6,err=-6)
    > Jan  2 14:26:03 kam02 kernel: cr_rstrt_child [28703]:  Unable to 
    > restore files!  (err=-6)
    >
    > Perhaps something to do with the socket.
    > What do you think?
    >
    >                                Regards,
    >                                   Jerry
    >
    > P.S. I have prelinking turned off.
    >
    >
    > cat
    >
    > Paul H. Hargrove wrote:
    >
    >> Jerry Mersel wrote:
    >>
    >>> Hi:
    >>>
    >>>  I am trying to migrate jobs on a grid after checkpointing.
    >>> Does the "prelinking" fix as mentioned in the faq must it be done
    >>> on the checkpointed node and the migrated to node?
    >>>
    >>>                                     Regards,
    >>>                                        Jerry
    >>
    >> Yes, the prelinking of libraries should be disabled on both the 
    >> "checkpointed on" and "migrated to" nodes.
    >> I will clarify this in the next FAQ version.
    >>
    >> -Paul
    >>
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Jerry Mersel: "Re: berkeley checkpointing"