Re: LAM: Checkpoint is correct, BUT cannot restart with LAM+BLCR

From: Jerry Mersel (jerry.mersel_at_weizmann.ac.il)
Date: Sun Nov 30 2008 - 02:48:01 PST

  • Next message: Paul H. Hargrove: "BLCR 0.8.0 beta1 is now available"
    Hi Paul:
    
     I'm running on one machine that is running mpirun and the program hello.
    
     I restart it with cr_restart and mpirun restarts but not the processes.
    
     Thank you for your effort and patience.
    
                           Regards,
                             Jerry
    
    P.S. See attachment for log
    
    
    
    
    > Jerry,
    >
    >    Of the three BLCR kernel modules, only <filename> == blcr.ko needs the
    > cr_ktrace_mask=0xffffffff argument.  That should be equivalent to the make
    > command I suggested.
    >
    > -Paul
    >
    > Jerry Mersel wrote:
    >> Hi Paul:
    >>
    >>
    >>  Would insmod <filename> cr_ktrace_mask=0xffffffff have the same effect?
    >>
    >>
    >>
    >>
    >>> Jerry,
    >>>  Please try loading the BLCR modules with "make insmod
    >>> cr_ktrace_mask=0xffffffff" to enable the highest level of debugging
    >>> output.  I suspect there will be additional output after the "parent
    >>> linkage" message.
    >>> -Paul
    >>>
    >>> Jerry Mersel wrote:
    >>>> Hi:
    >>>>
    >>>>    I also see the same errors as  zhangkan.
    >>>>
    >>>>    Also stopping on Parent linkage.
    >>>>
    >>>>    I just manage to start mpirun but not the children,
    >>>>    and I need to reboot the machine to get rid of mpirun.
    >>>>    I can't kill it. It goes into permanent sleep mode.
    >>>>
    >>>>
    >>>>                             Regards,
    >>>>                                Jerry
    >>>>
    >>>>
    >>>
    >>> --
    >>> Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>> Future Technologies Group                 Tel: +1-510-495-2352
    >>> HPC Research Department                   Fax: +1-510-486-6900
    >>> Lawrence Berkeley National Laboratory
    >>>
    >>>
    >>
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    
    
    
    


  • Next message: Paul H. Hargrove: "BLCR 0.8.0 beta1 is now available"