Re: problem migrating jobs

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun May 10 2009 - 10:23:09 PDT

  • Next message: Paul H. Hargrove: "Re: Question about BLCR syscall"
    Sergio,
    
       Your problem sounds like a problem of not having identical shared libraries 
    on host A and host B.  One possibility is that the two hosts have different 
    versions of libs installed, and a second possibility is that they could have 
    the same versions installed, but that "prelinking" may be mapping them to 
    different addresses on the two hosts.
    
       If you think that the libraries installed on the two hosts are the same, 
    then try the instructions in our FAQ for disabling pre-linking: 
    http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
    
       If you know that the library versions are /not/ the same, or if disabling 
    pre-linking does not help, then you will need to add the "--save-private" flag 
    to the cr_checkpoint command in the SGE migration script to request that BLCR 
    include copies of the libraries in the context file.
    
       I hope one of the two suggestions above resolves your problem.  If not, let 
    use know and we'll see what else we can try.
    
    -Paul
    
    Sergio D�az wrote:
    > Hi all,
    > 
    > I am using BLCR + SGE to do checkpoint to my jobs. It's working fine and 
    > also I can migrate the job (doing qmod -s JOB_ID).
    > The problem is the next: If I have a job running in host A and I do a 
    > qmod -s JOB_ID (to migrate the job), SGE launch the migration script and 
    > do a checkpoint, kill the job and put the job in the queue. When a host 
    > is free, SGE runs the job in the host. If the job runs in the host A, it 
    > finishes fine but if  the job is runned in other host (host B for 
    > instance) the job fails.
    > 
    > Doing a strace to the command cr_restart archivo_checkpoint I can see 
    > the following:
    > 
    > If the job runs in the same host:
    >> .....
    >> close(5)                                = 0
    >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], __WCLONE|__WALL, 
    >> NULL) = 27782
    >> --- SIGCHLD (Child exited) @ 0 (0) ---
    >> exit_group(0)                           = ?
    >> Process 27972 detached
    > 
    > If the job runs in other host:
    > 
    >> ....
    >> close(5)                                = 0
    >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    >> __WCLONE|__WALL, NULL) = 27782
    >> --- SIGCHLD (Child exited) @ 0 (0) ---
    >> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    >> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    >> tgkill(8889, 8889, SIGSEGV)             = 0
    >> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    >> +++ killed by SIGSEGV +++
    >> Process 8889 detached
    > 
    > 
    > Any ideas??
    > 
    > Regards,
    > Sergio
    > 
    > 
    > 
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: Question about BLCR syscall"