Re: LAM: Checkpoint is correct, BUT cannot restart with LAM+BLCR

From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Sat Dec 06 2008 - 23:04:41 PST

  • Next message: Jerry Mersel: "trying 0.8.0-b3"
    I'll do it today.
    
            Regards,
             Jerry
    
    > Jerry,
    >
    >    Because of the nature of the problem (in the restart code in mpirun),
    > even a hello_world program should be sufficient to determine if my fix
    > is correct.
    >   I simply don't have the time/patience to configure, build and install
    > LAM just now.  So, I appreciate your help.
    >
    > -Paul
    >
    > Jerry Mersel wrote:
    >> Hi Paul:
    >>
    >>   I don't mind checking my application with 0.8.0 and/or with
    >>   the patch for 0.7.3 but I was just using a small test case
    >>   as well where "Hello World 0 of 2" was printed out.
    >>
    >>   It wasn't a full blown application.
    >>
    >>   Do you still want me to try it with my test case?
    >>   I did downgrade to version 0.6.4 and it did work
    >>   with lam and gridengine, but it didn't work if the
    >>   checkpointed files were anywhere but in the home directory.
    >>
    >>   Interesting.
    >>
    >>
    >>                         Best regards,
    >>                            Jerry
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>> Based on Jerry's logs, I realized that the execve() call in LAM's
    >>> mpirun
    >>> at restart time was probably interacting poorly with changes made
    >>> beginning in BLCR 0.7.0.  I have been able to construct a compact test
    >>> case that is similar enough to LAM's mpirun behavior to reproduce the
    >>> symptom: a restarted mpirun-like process is unkillable and does not
    >>> finish the execve() call.
    >>>
    >>> Testing shows that BLCR 0.6.0 works as expected, while 0.7.0 and 0.7.3
    >>> both hang as described above.
    >>>
    >>> The good news is that the exec-from-callback behavior is very similar
    >>> (from BLCR's point of view) to the SEGV-from-callback reported as bug
    >>> 2318 ( http://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2318 ).  My
    >>> testing shows that applying the "Proposed fix" attached to that bug
    >>> report to 0.7.3 resolves the problem for my small test case.
    >>> Additionally, since this patch is already part of the 0.8.0 betas, the
    >>> problem Jerry reports is probably NOT present in 0.8.0 (my test case is
    >>> fine with 0.8.0_b2).
    >>>
    >>> Jerry,
    >>>   I don't have a complete LAM/MPI build to test against.  So, I could
    >>> really use your help to confirm that the same patch that fixes my small
    >>> test case works for your fill mpirum+application.  If you could please:
    >>> rebuild the BLCR kernel modules for 0.7.3 with the "proposed fix" for
    >>> bug 2318 (available at
    >>> http://upc-bugs.lbl.gov/bugzilla/attachment.cgi?id=298 ), rmmod+insmod
    >>> blcr.ko and retry your restart.  The patch does not change the context
    >>> file format in any way, so it should be safe to restart from your
    >>> existing checkpoint (assuming it was generated with BLCR 0.7.3).
    >>>
    >>> There is still a small BLCR "glitch" with the execve() call: the
    >>> restart
    >>> doesn't appear complete until the restarted mpirun exits where the
    >>> 0.6.0
    >>> behavior was to complete "immediately".  I have a plan to resolve this
    >>> for 0.8.0.
    >>>
    >>> -Paul
    >>>
    >>> Jerry Mersel wrote:
    >>>
    >>>> Hi Paul:
    >>>>
    >>>>  I'm running on one machine that is running mpirun and the program
    >>>> hello.
    >>>>
    >>>>  I restart it with cr_restart and mpirun restarts but not the
    >>>> processes.
    >>>>
    >>>>  Thank you for your effort and patience.
    >>>>
    >>>>                        Regards,
    >>>>                          Jerry
    >>>>
    >>>> P.S. See attachment for log
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>> Jerry,
    >>>>>
    >>>>>    Of the three BLCR kernel modules, only <filename> == blcr.ko needs
    >>>>> the
    >>>>> cr_ktrace_mask=0xffffffff argument.  That should be equivalent to the
    >>>>> make
    >>>>> command I suggested.
    >>>>>
    >>>>> -Paul
    >>>>>
    >>>>> Jerry Mersel wrote:
    >>>>>
    >>>>>
    >>>>>> Hi Paul:
    >>>>>>
    >>>>>>
    >>>>>>  Would insmod <filename> cr_ktrace_mask=0xffffffff have the same
    >>>>>> effect?
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>> Jerry,
    >>>>>>>  Please try loading the BLCR modules with "make insmod
    >>>>>>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging
    >>>>>>> output.  I suspect there will be additional output after the
    >>>>>>> "parent
    >>>>>>> linkage" message.
    >>>>>>> -Paul
    >>>>>>>
    >>>>>>> Jerry Mersel wrote:
    >>>>>>>
    >>>>>>>
    >>>>>>>> Hi:
    >>>>>>>>
    >>>>>>>>    I also see the same errors as  zhangkan.
    >>>>>>>>
    >>>>>>>>    Also stopping on Parent linkage.
    >>>>>>>>
    >>>>>>>>    I just manage to start mpirun but not the children,
    >>>>>>>>    and I need to reboot the machine to get rid of mpirun.
    >>>>>>>>    I can't kill it. It goes into permanent sleep mode.
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>                             Regards,
    >>>>>>>>                                Jerry
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>
    >>>>>>>>
    >>>>>>> --
    >>>>>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>>>>> Future Technologies Group                 Tel: +1-510-495-2352
    >>>>>>> HPC Research Department                   Fax: +1-510-486-6900
    >>>>>>> Lawrence Berkeley National Laboratory
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>>>>
    >>>>> --
    >>>>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>>>> Future Technologies Group
    >>>>> HPC Research Department                   Tel: +1-510-495-2352
    >>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>>>>
    >>>>>
    >>>>>
    >>> --
    >>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>> Future Technologies Group                 Tel: +1-510-495-2352
    >>> HPC Research Department                   Fax: +1-510-486-6900
    >>> Lawrence Berkeley National Laboratory
    >>>
    >>>
    >>>
    >>>
    >>
    >>
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    

  • Next message: Jerry Mersel: "trying 0.8.0-b3"