Re: problem migrating jobs

Date view	Thread view	Subject view	Author view	Attachment view

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Sun May 10 2009 - 10:23:09 PDT

Next message: Paul H. Hargrove: "Re: Question about BLCR syscall"

Previous message: Eric Roman: "Re: Question about BLCR syscall"
In reply to: Sergio D�az: "problem migrating jobs"
Next in thread: Sergio D�az: "Re: problem migrating jobs"
Reply: Sergio D�az: "Re: problem migrating jobs"

Sergio,

   Your problem sounds like a problem of not having identical shared libraries 
on host A and host B.  One possibility is that the two hosts have different 
versions of libs installed, and a second possibility is that they could have 
the same versions installed, but that "prelinking" may be mapping them to 
different addresses on the two hosts.

   If you think that the libraries installed on the two hosts are the same, 
then try the instructions in our FAQ for disabling pre-linking: 
http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .

   If you know that the library versions are /not/ the same, or if disabling 
pre-linking does not help, then you will need to add the "--save-private" flag 
to the cr_checkpoint command in the SGE migration script to request that BLCR 
include copies of the libraries in the context file.

   I hope one of the two suggestions above resolves your problem.  If not, let 
use know and we'll see what else we can try.

-Paul

Sergio D�az wrote:
> Hi all,
> 
> I am using BLCR + SGE to do checkpoint to my jobs. It's working fine and 
> also I can migrate the job (doing qmod -s JOB_ID).
> The problem is the next: If I have a job running in host A and I do a 
> qmod -s JOB_ID (to migrate the job), SGE launch the migration script and 
> do a checkpoint, kill the job and put the job in the queue. When a host 
> is free, SGE runs the job in the host. If the job runs in the host A, it 
> finishes fine but if  the job is runned in other host (host B for 
> instance) the job fails.
> 
> Doing a strace to the command cr_restart archivo_checkpoint I can see 
> the following:
> 
> If the job runs in the same host:
>> .....
>> close(5)                                = 0
>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], __WCLONE|__WALL, 
>> NULL) = 27782
>> --- SIGCHLD (Child exited) @ 0 (0) ---
>> exit_group(0)                           = ?
>> Process 27972 detached
> 
> If the job runs in other host:
> 
>> ....
>> close(5)                                = 0
>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
>> __WCLONE|__WALL, NULL) = 27782
>> --- SIGCHLD (Child exited) @ 0 (0) ---
>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
>> tgkill(8889, 8889, SIGSEGV)             = 0
>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>> +++ killed by SIGSEGV +++
>> Process 8889 detached
> 
> 
> Any ideas??
> 
> Regards,
> Sergio
> 
> 
> 
> 

-- 
Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
Future Technologies Group
HPC Research Department                   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Next message: Paul H. Hargrove: "Re: Question about BLCR syscall"

Previous message: Eric Roman: "Re: Question about BLCR syscall"
In reply to: Sergio D�az: "problem migrating jobs"
Next in thread: Sergio D�az: "Re: problem migrating jobs"
Reply: Sergio D�az: "Re: problem migrating jobs"

Date view	Thread view	Subject view	Author view	Attachment view