Re: Error while using cr_checkpoint on ARM

From: Manish Dwivedi (mdwivedi_at_gmail_dot_com)
Date: Thu Aug 07 2008 - 22:46:27 PDT

  • Next message: Paul H. Hargrove: "Re: Error while using cr_checkpoint on ARM"
    Hi Paul,
    
    We have reload the modules with the cr_ktrace_mask variable and got the logs
    as follows:
    ==========================================================
    blcr: Berkeley Lab Checkpoint/Restart (BLCR) module version 0.7.2.
    blcr:   Tracing enabled (trace_mask=0xffffffff)
    blcr:   Supports kernel interface version 0.9.0.
    blcr:   Supports context file format version 7.
    blcr: http://ftg.lbl.gov/checkpoint
    cr_proc_init <cr_proc.c:47>, pid 704: entering
    eth0: spurious interrupt (mask = 0xb3)
    ctrl_open <cr_fops.c:231>, pid 712: entering
    ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a130 arg=0x9
    ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a107 arg=0x4
    cr_phase2_register <cr_sync.c:73>, pid 712: entering
    __cr_task_get <cr_task.c:98>, pid 712: Alloc cr_task_t c53e8628 for pid 712
    ctrl_ioctl <cr_fops.c:133>, pid 714: entering op=4004a105 arg=0x4
    cr_phase1_register <cr_async.c:145>, pid 714: entering
    __cr_task_get <cr_task.c:98>, pid 714: Alloc cr_task_t c53e85cc for pid 714
    ctrl_ioctl <cr_fops.c:133>, pid 714: entering op=c004a106 arg=0x0
    cr_suspend <cr_async.c:79>, pid 714: entering
    ctrl_open <cr_fops.c:231>, pid 712: entering
    ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a130 arg=0x9
    ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=4004a110 arg=0xbe9ed9b4
    cr_chkpt_req <cr_chkpt_req.c:856>, pid 712: entering
    cr_log_request <cr_chkpt_req.c:834>, pid 712: checkpointing process tree 709
    cr_chkpt_req <cr_chkpt_req.c:878>, pid 712:  checkpoint params:  secs= 0,
    opts=00000000, fd=3
    alloc_request <cr_chkpt_req.c:254>, pid 712: Alloc cr_chkpt_req_t c4f6bb38
    ctrl_open <cr_fops.c:231>, pid 712: entering
    cr_loc_init <cr_dest_file.c:158>, pid 712: Calling do_init_reg on fd 3
    build_req_tree <cr_chkpt_req.c:707>, pid 712: in build_req_tree
    add_proc <cr_chkpt_req.c:562>, pid 712: Add proc pid=709
    add_task <cr_chkpt_req.c:436>, pid 712: entering task=c0536040 (709)
    __cr_task_get <cr_task.c:98>, pid 712: Alloc cr_task_t c53e8570 for pid 709
    build_req_tree <cr_chkpt_req.c:722>, pid 712: scanning children
    build_req_tree <cr_chkpt_req.c:728>, pid 712: found child 709
    do_trigger <cr_trigger.c:94>, pid 712: triggered pid 709 (hello) w/ retval=0
    ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=c004a111 arg=0x0
    cr_chkpt_done <cr_chkpt_req.c:1383>, pid 712: entering
    ctrl_ioctl <cr_fops.c:133>, pid 709: entering op=4004a101 arg=0x4000
    cr_dump_self <cr_dump_self.c:999>, pid 709: entering flags=0x4000
    cr_dump_self <cr_dump_self.c:1019>, pid 709: NOTIFY(&req->preshared_barrier)
    cr_dump_self <cr_dump_self.c:1020>, pid 709: TEST(&req->preshared_barrier)
    returning 1
    cr_save_file_header <cr_dump_self.c:697>, pid 709: Dumping file header
    cr_signal_predump_barrier <cr_chkpt_req.c:1148>, pid 709:
    NOTIFY(&proc_req->predump_barrier)
    cr_signal_predump_barrier <cr_chkpt_req.c:1149>, pid 709:
    ONCE(&proc_req->predump_barrier, 1) begin
    cr_signal_predump_barrier <cr_chkpt_req.c:1149>, pid 709:
    ONCE(&proc_req->predump_barrier, 1) returning 1
    cr_do_vmadump <cr_dump_self.c:777>, pid 709: Preparing to dump 1 threads of
    hello
    cr_save_header <cr_dump_self.c:725>, pid 709: Dumping header for 1 threads
    cr_do_vmadump <cr_dump_self.c:786>, pid 709: Writing the per-process
    linkage.
    cr_do_vmadump <cr_dump_self.c:813>, pid 709: Writing credentials
    cr_do_vmadump <cr_dump_self.c:938>, pid 709:
    NOTIFY(&proc_req->vmadump_barrier)
    cr_dump_self <cr_dump_self.c:1115>, pid 709: ENTER(&req->postdump_barrier)
    begin
    cr_dump_self <cr_dump_self.c:1115>, pid 709: ENTER(&req->postdump_barrier)
    returning 1
    cr_dump_self <cr_dump_self.c:1120>, pid 709: Writing the trailer.
    cr_save_header <cr_dump_self.c:725>, pid 709: Dumping header for 0 threads
    cr_signal_chkpt_complete_barrier <cr_chkpt_req.c:1178>, pid 709:
    NOTIFY(&proc_req->pre_complete_barrier)
    cr_signal_chkpt_complete_barrier <cr_chkpt_req.c:1180>, pid 709:
    ONCE(&proc_req->pre_complete_barrier, 1) begin
    cr_signal_chkpt_complete_barrier <cr_chkpt_req.c:1180>, pid 709:
    ONCE(&proc_req->pre_complete_barrier, 1) returning 1
    cr_chkpt_task_complete <cr_chkpt_req.c:1305>, pid 709:
    NOTIFY(&proc_req->post_complete_barrier)
    cr_chkpt_task_complete <cr_chkpt_req.c:1306>, pid 709:
    WAIT(&proc_req->post_complete_barrier) begin
    cr_chkpt_task_complete <cr_chkpt_req.c:1306>, pid 709:
    WAIT(&proc_req->post_complete_barrier) returning 1
    __cr_task_put <cr_task.c:126>, pid 709: Free cr_task_t c53e8570
    cr_dump_self <cr_dump_self.c:1152>, pid 709: leaving Returning -12
    cr_chkpt_done <cr_chkpt_req.c:1424>, pid 712: leaving Returning 1
    ctrl_ioctl <cr_fops.c:133>, pid 712: entering op=0000a112 arg=0xffffffff
    ctrl_release <cr_fops.c:246>, pid 712: entering
    release_request <cr_chkpt_req.c:52>, pid 712: Free cr_chkpt_req_t c4f6bb38
    ctrl_release <cr_fops.c:246>, pid 712: entering
    cr_suspend <cr_async.c:117>, pid 714: leaving with pending signal
    ctrl_release <cr_fops.c:246>, pid 712: entering
    __cr_task_put <cr_task.c:126>, pid 712: Free cr_task_t c53e85cc
    __cr_task_put <cr_task.c:126>, pid 712: Free cr_task_t c53e8628
    
    ================================================================
    
    Thanks a lot for your help.
    
    Regards,
    Manish
    On Fri, Aug 8, 2008 at 1:20 A-------M, Paul H. Hargrove
    <PHHargrove_at_lbl_dot_gov>wrote:
    
    > Manish,
    >
    >  There is no stated/known minimum memory requirement for BLCR, but it is
    > still possible that we are too aggressive with memory.  I run an emulated
    > ARM environment in QEMU and have not yet tried running with so little memory
    > (though I plan to try today).
    >  The default level of tracing detail didn't produce much output for your
    > case because the failure appears to come relatively early.  By requesting
    > more detailed tracing, we should be able to narrow down when in BLCR we've
    > failed to allocate memory.
    >  Please reload the kernel modules with "make insmod
    > cr_ktrace_mask=0xffffffff", which will enable the most detailed tracing.
    >  Then rerun your failed checkpoint and, again, send the output.  Hopefully
    > this time there will be enough for me to move forward on diagnosing your
    > problem.
    >
    > Thanks for your patience,
    > Paul
    >
    > Manish Dwivedi wrote:
    >
    >> Hi Paul,
    >>
    >> Thanks for the information. We tried compiling it with the enable-debug
    >> option today. But we didn't get much information in the log (log file is
    >> attached in the e-mail.
    >>
    >> In between, we have 64 MB RAM in the system, is there a limitation or
    >> minimum requirement of the RAM in BLCR ?
    >>
    >> Regards,
    >> Manish
    >>
    >> Ps: We followed the exactly same process for X86 and it is working fine
    >> for us.
    >>
    >>
    >> On Wed, Aug 6, 2008 at 10:58 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov<mailto:
    >> PHHargrove_at_lbl_dot_gov>> wrote:
    >>
    >>    Manish,
    >>
    >>     I am sorry to hear that you are having problems.  From the
    >>    information you provide below, it is hard to say what the problem
    >>    is, other than to guess that your ARM system is low on memory.
    >>     I am aware of a kernel-side memory leak in blcr-0.7.2, which
    >>    should be fixed in the 0.7.3 release expected later this week or
    >>    early next week.  So, I'd like to know if the failure you describe
    >>    happens on the very first use of cr_checkpoint, or does it happen
    >>    after BLCR has been used several times (for instance by running
    >>    "make check")?  If it works for a while and then begins to fail,
    >>    I'd suspect the known memory leak and suggest that you wait for
    >>    blcr-0.7.3.
    >>     If you are seeing failure on the very first attempt to use blcr,
    >>    then I suggest that you rebuild blcr with debugging enabled and
    >>    send me the information dumped to the system logs (run dmesg or
    >>    see /var/log/messages to find the logs).  To do this, you'll need
    >>    to start at the beginning of the configure/make/install process
    >>    and pass the "--enable-debug" option to configure, and then
    >>    proceed with the rest of the build/install process.  Be sure to
    >>    "make insmod" (or manually rmmod the old modules and
    >>    insmod/modprobe the new ones); otherwise the kernel modules from
    >>    your previous (non-debug) build may still be running.  With the
    >>    new kernel modules loaded, you should retry your failing command
    >>    and then look for messages with "blcr: " in them in the system logs.
    >>
    >>     I also should tell you that there is an ARM-specific mailing list
    >>    (very low volume) for BLCR that may help you reach other ARM
    >>    users.  You can find list info and subscribe (required to post) at
    >>    https://hpcrdm.lbl.gov/mailman/listinfo/blcr-arm
    >>
    >>    -Paul
    >>
    >>
    >>    Manish Dwivedi wrote:
    >>
    >>        Hi All,
    >>
    >>        I am trying to use BLCR for ARM. But when I am trying to use
    >>        cr_checkpoint with a hello.c program it is giving me an error
    >>        as below:
    >>
    >>        cr_checkpoint --term <pid> (command run)
    >>        Checkpoint failed: Cannot allocate memory
    >>
    >>        I have compiled hello.c in the same kernel as mentioned in the
    >>        release notes, I am using blcr-0.7.2.tar.gz for this.
    >>
    >>        Could anyone help me out resolving this issue so that I can
    >>        test it. It works fine for me on a X86 machine.
    >>
    >>        Regards,
    >>        Manish
    >>
    >>
    >>
    >>    --    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>    <mailto:PHHargrove_at_lbl_dot_gov>
    >>    Future Technologies Group                 HPC Research Department
    >>                      Tel: +1-510-495-2352
    >>    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >>
    >>
    >>
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group
    > HPC Research Department                   Tel: +1-510-495-2352
    > Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    >
    >
    

  • Next message: Paul H. Hargrove: "Re: Error while using cr_checkpoint on ARM"