Re: problem migrating jobs

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Fri May 08 2009 - 04:25:47 PDT

  • Next message: Eric Roman: "Re: Question about BLCR syscall"
    I'm using Gaussian03 64bits for these tests.
    I have done some tests more. If the job is running in host A, doing the 
    checkpoing and rebooting the host A, when the host is available again, 
    the job can restart in the host A without problem. So, I guess that 
    there are no problems with env variables or something allocated in 
    memory....
    
    regards,
    Sergio
    
    
    
    Sergio Díaz escribió:
    > Hi all,
    >
    > I am using BLCR + SGE to do checkpoint to my jobs. It's working fine 
    > and also I can migrate the job (doing qmod -s JOB_ID).
    > The problem is the next: If I have a job running in host A and I do a 
    > qmod -s JOB_ID (to migrate the job), SGE launch the migration script 
    > and do a checkpoint, kill the job and put the job in the queue. When a 
    > host is free, SGE runs the job in the host. If the job runs in the 
    > host A, it finishes fine but if  the job is runned in other host (host 
    > B for instance) the job fails.
    >
    > Doing a strace to the command cr_restart archivo_checkpoint I can see 
    > the following:
    >
    > If the job runs in the same host:
    >> .....
    >> close(5)                                = 0
    >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
    >> __WCLONE|__WALL, NULL) = 27782
    >> --- SIGCHLD (Child exited) @ 0 (0) ---
    >> exit_group(0)                           = ?
    >> Process 27972 detached
    >
    > If the job runs in other host:
    >
    >> ....
    >> close(5)                                = 0
    >> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    >> __WCLONE|__WALL, NULL) = 27782
    >> --- SIGCHLD (Child exited) @ 0 (0) ---
    >> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    >> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    >> tgkill(8889, 8889, SIGSEGV)             = 0
    >> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    >> +++ killed by SIGSEGV +++
    >> Process 8889 detached
    >
    >
    > Any ideas??
    >
    > Regards,
    > Sergio
    >
    >
    >
    >
    
    
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: [email protected] ; http://www.cesga.es/
    ------------------------------------------------ 
    

  • Next message: Eric Roman: "Re: Question about BLCR syscall"