Re: LAM: Checkpoint is correct, BUT cannot restart with LAM+BLCR

From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Thu Dec 04 2008 - 02:03:48 PST

  • Next message: Josh Hursey: "Re: OpenMPI and BLCR 0.8.0b2"
    Hi Paul:
    
      I don't mind checking my application with 0.8.0 and/or with
      the patch for 0.7.3 but I was just using a small test case
      as well where "Hello World 0 of 2" was printed out.
    
      It wasn't a full blown application.
    
      Do you still want me to try it with my test case?
      I did downgrade to version 0.6.4 and it did work
      with lam and gridengine, but it didn't work if the
      checkpointed files were anywhere but in the home directory.
    
      Interesting.
    
    
                            Best regards,
                               Jerry
    
    
    
    
    
    
    
    
    
    
    
    > Based on Jerry's logs, I realized that the execve() call in LAM's mpirun
    > at restart time was probably interacting poorly with changes made
    > beginning in BLCR 0.7.0.  I have been able to construct a compact test
    > case that is similar enough to LAM's mpirun behavior to reproduce the
    > symptom: a restarted mpirun-like process is unkillable and does not
    > finish the execve() call.
    >
    > Testing shows that BLCR 0.6.0 works as expected, while 0.7.0 and 0.7.3
    > both hang as described above.
    >
    > The good news is that the exec-from-callback behavior is very similar
    > (from BLCR's point of view) to the SEGV-from-callback reported as bug
    > 2318 ( http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2318 ).  My
    > testing shows that applying the "Proposed fix" attached to that bug
    > report to 0.7.3 resolves the problem for my small test case.
    > Additionally, since this patch is already part of the 0.8.0 betas, the
    > problem Jerry reports is probably NOT present in 0.8.0 (my test case is
    > fine with 0.8.0_b2).
    >
    > Jerry,
    >   I don't have a complete LAM/MPI build to test against.  So, I could
    > really use your help to confirm that the same patch that fixes my small
    > test case works for your fill mpirum+application.  If you could please:
    > rebuild the BLCR kernel modules for 0.7.3 with the "proposed fix" for
    > bug 2318 (available at
    > http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=298 ), rmmod+insmod
    > blcr.ko and retry your restart.  The patch does not change the context
    > file format in any way, so it should be safe to restart from your
    > existing checkpoint (assuming it was generated with BLCR 0.7.3).
    >
    > There is still a small BLCR "glitch" with the execve() call: the restart
    > doesn't appear complete until the restarted mpirun exits where the 0.6.0
    > behavior was to complete "immediately".  I have a plan to resolve this
    > for 0.8.0.
    >
    > -Paul
    >
    > Jerry Mersel wrote:
    >> Hi Paul:
    >>
    >>  I'm running on one machine that is running mpirun and the program
    >> hello.
    >>
    >>  I restart it with cr_restart and mpirun restarts but not the processes.
    >>
    >>  Thank you for your effort and patience.
    >>
    >>                        Regards,
    >>                          Jerry
    >>
    >> P.S. See attachment for log
    >>
    >>
    >>
    >>
    >>
    >>> Jerry,
    >>>
    >>>    Of the three BLCR kernel modules, only <filename> == blcr.ko needs
    >>> the
    >>> cr_ktrace_mask=0xffffffff argument.  That should be equivalent to the
    >>> make
    >>> command I suggested.
    >>>
    >>> -Paul
    >>>
    >>> Jerry Mersel wrote:
    >>>
    >>>> Hi Paul:
    >>>>
    >>>>
    >>>>  Would insmod <filename> cr_ktrace_mask=0xffffffff have the same
    >>>> effect?
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>> Jerry,
    >>>>>  Please try loading the BLCR modules with "make insmod
    >>>>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging
    >>>>> output.  I suspect there will be additional output after the "parent
    >>>>> linkage" message.
    >>>>> -Paul
    >>>>>
    >>>>> Jerry Mersel wrote:
    >>>>>
    >>>>>> Hi:
    >>>>>>
    >>>>>>    I also see the same errors as  zhangkan.
    >>>>>>
    >>>>>>    Also stopping on Parent linkage.
    >>>>>>
    >>>>>>    I just manage to start mpirun but not the children,
    >>>>>>    and I need to reboot the machine to get rid of mpirun.
    >>>>>>    I can't kill it. It goes into permanent sleep mode.
    >>>>>>
    >>>>>>
    >>>>>>                             Regards,
    >>>>>>                                Jerry
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>> --
    >>>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>>> Future Technologies Group                 Tel: +1-510-495-2352
    >>>>> HPC Research Department                   Fax: +1-510-486-6900
    >>>>> Lawrence Berkeley National Laboratory
    >>>>>
    >>>>>
    >>>>>
    >>>>
    >>> --
    >>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>> Future Technologies Group
    >>> HPC Research Department                   Tel: +1-510-495-2352
    >>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>>
    >> >
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    >
    >
    >
    

  • Next message: Josh Hursey: "Re: OpenMPI and BLCR 0.8.0b2"