Re: problem migrating jobs

From: Sergio Díaz (sdiaz_at_cesga.es)
Date: Fri May 15 2009 - 06:35:32 PDT

  • Next message: Paul H. Hargrove: "Re: problem migrating jobs"
    Hi Adolfo,
    
    Thank for you collaboration. It is very interesting to test.
    I have tested this and I think that my problem isn't about stack size 
    because I have done some tests sending jobs with h_stack=8M and 
    h_stack=16M and the jobs failed. It's true that restarting are two 
    threads created but are also the same threads with the option 
    --save-private and without it option. With this option failed and 
    without this option don't.
    Did you do something special to set the stack size to half of the 
    available memory?
    
    Regards,
    Sergio
    
    
    
    
    
    Adolfo J. Banchio escribió:
    > Hi Sergio,
    >
    > A while ago, when upgrading to BLCR 0.7 (I think from
    > there on it started using threads, blcr itself) a had
    > some problems restarting within SGE, even disabling
    > prelinking (in this sense it is different from your case,
    > but it might help anyway).
    > At that time I found that the problem was related with
    > resources set by SGE. Namely, if Stack Size was set to unlimited 
    > (the default value in the queue definition) SGE allocated a Stack
    > Size equal to the whole memory for each thread of cr_restart, 
    > so the second thread (from cr_restart itself, not related to the
    > actual job, which could be just serial) never gets free
    > memory and it crashed.
    >
    > If I remember well the only workaround I found was to set
    > Stack Size to half of the available memory, so that after
    > migration, there were enough space for both threads (since
    > cr_restart seems to have two threads, and SGE set the stack
    > for each from that limit). This was the solution at that time,
    > and I still have those limits set.
    >
    > I am not sure that this still applies (in my case disabling
    > prelinking did not help, so, the problem was probably not the
    > same as yours), but I wanted to share this with you, just in case.  
    >
    > regards,
    >
    > adolfo
    >
    >
    >
    >
    > On Thu, 2009-05-14 at 14:19 +0200, Sergio Díaz wrote:
    >   
    >> Hi Paul,
    >>
    >> Respect to the limits. The different is that SGE set the two limits 
    >> below copied to the value of vmem which you put in the qsub. I used 
    >> vmem=1G and then SGE sets the limit to 1G. If you are working in the 
    >> host without SGE, these limits are "unlimited". I tested the checkpoint 
    >> without SGE and setting the limits to 1048576 and it worked fine. So, I 
    >> guess, the limits are not relevant.  About the environment (env), SGE 
    >> sets some variables but I tested without SGE setting the same 
    >> environment and the limits and it worked fine.
    >>
    >> I'll try to do more tests because I can't understand why doesn't work 
    >> with SGE. Meanwhile, I think disabling prelink could be the best option 
    >> to continue with my work.
    >>
    >>
    >> data seg size           (kbytes, -d) 1048576
    >> virtual memory          (kbytes, -v) 1048576
    >>
    >> Thanks!,
    >> Sergio
    >>
    >>
    >>
    >> Paul H. Hargrove escribió:
    >>     
    >>> Sergio,
    >>>
    >>>   Since --save-private allowed your to migrate the job when not using 
    >>> SGE, and disabling prelinking allowed success both with and without 
    >>> SGE I think we can conclude that prelinking was the original cause of 
    >>> your problems.  So, I recommend disabling prelinking as your best option.
    >>>
    >>>   The fact that use of --save-private and/or --save-exe caused errors 
    >>> with SGE is not something I would have guessed in advance.  However, I 
    >>> suspect that it has something to to with resource limits (like the 
    >>> limit or ulimit shell built-ins).  This might be something BLCR could 
    >>> work around in the future, but I have no guess at the moment how.  If 
    >>> for some reason you do not wish to disable prelinking, then there may 
    >>> be some resource limit setting in SGE that could be changed to 
    >>> eliminate the "Failed to locate newborn mmap()ed space" problem.  
    >>> However, I am not an SGE expert and so don't know where you would 
    >>> start looking.
    >>>
    >>>   You asked how disabling of prelinking would affect your systems.  
    >>> The answer is that it will cost you a small amount of performance, 
    >>> mostly at program startup.  It will not introduce errors in any 
    >>> programs.  prelinking is also viewed by some as a security 
    >>> improvement.  You can read more about prelinking at 
    >>> http://en.wikipedia.org/wiki/Prelinking
    >>>
    >>> -Paul
    >>>
    >>> Sergio Díaz wrote:
    >>>       
    >>>> Hi Paul and Adolfo,
    >>>>
    >>>> Adolfo, running the job without SGE, it doesn't work.
    >>>>
    >>>> Paul, doing the checkpoint with "--save-private" it works fine only 
    >>>> if I send the jobs without SGE. But If I send the job to SGE, it 
    >>>> doesn't restart fine. Neither in the same host. I get the following 
    >>>> error:
    >>>>
    >>>>         
    >>>>> - Failed to locate newborn mmap()ed space
    >>>>> - cr_rstrt_child [20249]:  Unable to load mmap()ed data!  (err=-22)
    >>>>> Restart failed: Invalid argument
    >>>>>           
    >>>> I don't understand why it doesn't work because SGE shouldn't affect 
    >>>> because the script that I use to the checkpoint is basically the same.
    >>>> I also tried using the option --save-all but don't work. I got the 
    >>>> same error. With the option --save-shared and --save-exe I got the 
    >>>> segmentation fault.
    >>>>
    >>>> 2nd attempt... disabling pre-linking and doing the cr_checkpoint with 
    >>>> the --save-private. I got the same error. But doing the cr_checkpoint 
    >>>> without --save-private, it works fine!! I did a successful migration 
    >>>> and the job finished fine.
    >>>>
    >>>> I have to research in which aspect could be affected the hosts if I 
    >>>> disable the pre-linking. Any idea? Less performance? problems with my 
    >>>> applications?
    >>>>
    >>>>
    >>>> Thanks a lot,
    >>>> Sergio
    >>>>
    >>>>
    >>>>
    >>>> Paul H. Hargrove escribió:
    >>>>         
    >>>>> Sergio,
    >>>>>
    >>>>>   Your problem sounds like a problem of not having identical shared 
    >>>>> libraries on host A and host B.  One possibility is that the two 
    >>>>> hosts have different versions of libs installed, and a second 
    >>>>> possibility is that they could have the same versions installed, but 
    >>>>> that "prelinking" may be mapping them to different addresses on the 
    >>>>> two hosts.
    >>>>>
    >>>>>   If you think that the libraries installed on the two hosts are the 
    >>>>> same, then try the instructions in our FAQ for disabling 
    >>>>> pre-linking: http://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink .
    >>>>>
    >>>>>   If you know that the library versions are /not/ the same, or if 
    >>>>> disabling pre-linking does not help, then you will need to add the 
    >>>>> "--save-private" flag to the cr_checkpoint command in the SGE 
    >>>>> migration script to request that BLCR include copies of the 
    >>>>> libraries in the context file.
    >>>>>
    >>>>>   I hope one of the two suggestions above resolves your problem.  If 
    >>>>> not, let use know and we'll see what else we can try.
    >>>>>
    >>>>> -Paul
    >>>>>
    >>>>> Sergio Díaz wrote:
    >>>>>           
    >>>>>> Hi all,
    >>>>>>
    >>>>>> I am using BLCR + SGE to do checkpoint to my jobs. It's working 
    >>>>>> fine and also I can migrate the job (doing qmod -s JOB_ID).
    >>>>>> The problem is the next: If I have a job running in host A and I do 
    >>>>>> a qmod -s JOB_ID (to migrate the job), SGE launch the migration 
    >>>>>> script and do a checkpoint, kill the job and put the job in the 
    >>>>>> queue. When a host is free, SGE runs the job in the host. If the 
    >>>>>> job runs in the host A, it finishes fine but if  the job is runned 
    >>>>>> in other host (host B for instance) the job fails.
    >>>>>>
    >>>>>> Doing a strace to the command cr_restart archivo_checkpoint I can 
    >>>>>> see the following:
    >>>>>>
    >>>>>> If the job runs in the same host:
    >>>>>>             
    >>>>>>> .....
    >>>>>>> close(5)                                = 0
    >>>>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>>>>> wait4(27782, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 
    >>>>>>> __WCLONE|__WALL, NULL) = 27782
    >>>>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>>>>> exit_group(0)                           = ?
    >>>>>>> Process 27972 detached
    >>>>>>>               
    >>>>>> If the job runs in other host:
    >>>>>>
    >>>>>>             
    >>>>>>> ....
    >>>>>>> close(5)                                = 0
    >>>>>>> rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
    >>>>>>> wait4(27782, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGSEGV}], 
    >>>>>>> __WCLONE|__WALL, NULL) = 27782
    >>>>>>> --- SIGCHLD (Child exited) @ 0 (0) ---
    >>>>>>> setrlimit(RLIMIT_CORE, {rlim_cur=0, rlim_max=0}) = 0
    >>>>>>> rt_sigaction(SIGSEGV, {SIG_DFL}, NULL, 8) = 0
    >>>>>>> tgkill(8889, 8889, SIGSEGV)             = 0
    >>>>>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
    >>>>>>> +++ killed by SIGSEGV +++
    >>>>>>> Process 8889 detached
    >>>>>>>               
    >>>>>> Any ideas??
    >>>>>>
    >>>>>> Regards,
    >>>>>> Sergio
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>             
    >>>>>           
    >>>>         
    >>>       
    >>     
    
    
    -- 
    Sergio Díaz Montes
    Centro de Supercomputacion de Galicia
    Avda. de Vigo. s/n (Campus Sur) 15706 Santiago de Compostela (Spain)
    Tel: +34 981 56 98 10 ; Fax: +34 981 59 46 16
    email: [email protected] ; http://www.cesga.es/
    ------------------------------------------------ 
    

  • Next message: Paul H. Hargrove: "Re: problem migrating jobs"