Re: Please advise me about restarting with BLCR

From: Eric Roman (ESRoman_at_berkeley_dot_edu)
Date: Thu Oct 25 2007 - 10:47:12 PDT

  • Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"
    I'm not sure where the invalid argument happened.  There are a lot of
    places where we can return EINVAL.
    
    Let's get the easy stuff out of the way first.  Did make check work on the
    Opteron?  Try running that, and see if any of the tests pass.  If that's ok,
    my guess is that it's the MPI issue.  I'm really surprised that you were
    able to restart an MPICH code at all.
    
    We don't support BLCR with MPICH right now.  That really shouldn't work at all.
    If you want to checkpoint an MPI job, you can use LAM MPI or a recent release
    of MVAPICH (for Infiniband).  OpenMPI support is coming -- it's in their
    subversion tree, but not yet in a released version.  OpenMPI checkpointing
    will be released in a few weeks.
    
    For now, you'll need to build the LAM libraries with BLCR support, and
    relink your application with those libraries.  There are instructions
    for how to do this on the LAM MPI web page.  Once that's done, let me
    know if you still see the error.  We should work fine on the Opteron
    environment you're using.
    
    Eric
    
    On Thu, Oct 25, 2007 at 08:24:36PM +0900, Hideyuki Jitsumoto wrote:
    > Dear BLCR-ML-Members,
    > 
    > I trid to use BLCR for checkpointing mpich on 2 execution environments.
    > I used completely same codes on MPI application, mpich, and BLCR.
    > But on one environment, I got error message, "Restart failed: Invalid argument".
    > 
    > Environment
    > 1. VMware 6.1(Intel Core 2 Duo), Linux-2.6.8-2-686, gcc 3.3.5
    > 2. AMD Opteron 242*2, Linux-2.6.12.2, gcc 3.3.5
    > 
    > On Environment 1, I got correct restarting, but on Environment 2, I could't.
    > So, I compared kernel log with CR_KTRACE_ALL.
    > Then I noticed Environment2 has error on cr_rstrt_child.
    > 
    > Please advise me about what's happened on BLCR , if you have an idea.
    > Thank you.
    > 
    > -the contents of /var/log/message
    > Environment 1 had,
    > ....
    > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files
    > <cr_rstrt_req.c:1880>, pid 32752: : recovering fs_struct...
    > Oct 22 12:14:10 Concertino1 kernel: cr_load_file_info
    > <cr_rstrt_req.c:1339>, pid 32752: : entering
    > Oct 22 12:14:10 Concertino1 kernel: cr_restore_all_files
    > <cr_rstrt_req.c:1911>, pid 32752: :    fd=0 dnr=1
    > Oct 22 12:14:10 Concertino1 kernel: cr_restore_open_fifo
    > <cr_pipes.c:488>, pid 32752: : entering
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cef61800.
    > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    > pid 32752: : pipe:[57509]:  Phase 1: Making new pipe.
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks
    > <cr_rstrt_req.c:1819>, pid 32752: : entering
    > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info
    > <cr_rstrt_req.c:1339>, pid 32752: : entering
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files
    > <cr_rstrt_req.c:1911>, pid 32752: :    fd=1 dnr=1
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > <cr_pipes.c:488>, pid 32752: : entering
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cef616c0.
    > Oct 22 12:14:11 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    > pid 32752: : pipe:[57510]:  Phase 1: Making new pipe.
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_file_locks
    > <cr_rstrt_req.c:1819>, pid 32752: : entering
    > Oct 22 12:14:11 Concertino1 kernel: cr_load_file_info
    > <cr_rstrt_req.c:1339>, pid 32752: : entering
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_all_files
    > <cr_rstrt_req.c:1911>, pid 32752: :    fd=2 dnr=1
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > <cr_pipes.c:488>, pid 32752: : entering
    > Oct 22 12:14:11 Concertino1 kernel: cr_restore_open_fifo
    > <cr_pipes.c:498>, pid 32752: :    Open fifo: id == cf2c16c0.
    > Oct 22 12:14:12 Concertino1 kernel: cr_make_new_pipe <cr_pipes.c:437>,
    > pid 32752: : pipe:[57511]:  Phase 1: Making new pipe.
    > Oct 22 12:14:12 Concertino1 kernel: cr_restore_file_locks
    > <cr_rstrt_req.c:1819>, pid 32752: : entering
    > ....
    > 
    > Environment 2 had,
    > ....
    > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    > <cr_rstrt_req.c:1880>, pid 18556: : recovering fs_struct...
    > Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    > <cr_rstrt_req.c:1339>, pid 18556: : entering
    > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    > <cr_rstrt_req.c:1911>, pid 18556: :    fd=0 dnr=1
    > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>,
    > pid 18556: : entering
    > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>,
    > pid 18556: :    Open fifo: id == f75172c0.
    > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid
    > 18556: : pipe:[595796]:  Phase 1: Making new pipe.
    > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks
    > <cr_rstrt_req.c:1819>, pid 18556: : entering
    > Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    > <cr_rstrt_req.c:1339>, pid 18556: : entering
    > Oct 25 18:43:29 pad047 kernel: cr_restore_all_files
    > <cr_rstrt_req.c:1911>, pid 18556: :    fd=1 dnr=1
    > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:488>,
    > pid 18556: : entering
    > Oct 25 18:43:29 pad047 kernel: cr_restore_open_fifo <cr_pipes.c:498>,
    > pid 18556: :    Open fifo: id == f7517698.
    > Oct 25 18:43:29 pad047 kernel: cr_make_new_pipe <cr_pipes.c:437>, pid
    > 18556: : pipe:[595797]:  Phase 1: Making new pipe.
    > Oct 25 18:43:29 pad047 kernel: cr_restore_file_locks
    > <cr_rstrt_req.c:1819>, pid 18556: : entering
    > Oct 25 18:43:29 pad047 kernel: cr_load_file_info
    > <cr_rstrt_req.c:1339>, pid 18556: : entering
    > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2424>,
    > pid 18556: : 18556: closing request descriptor
    > Oct 25 18:43:29 pad047 kernel: cr_rstrt_child <cr_rstrt_req.c:2435>,
    > pid 18556: : 18556: closing context file descriptor
    > Oct 25 18:43:29 pad047 kernel: release_rstrt_req <cr_rstrt_req.c:94>,
    > pid 18556: : ref count is approximately 2
    > Oct 25 18:43:29 pad047 kernel: __cr_task_put <cr_task.c:114>, pid
    > 18556: : Free cr_task_t ebf6f480
    > ....
    > 
    > -- 
    > Sincerely Yours,
    > Hideyuki Jitsumoto ([email protected])
    > Tokyo Institute of Technology Grad. School of Info. and Eng.
    > Dept. MCS (Matsuoka Lab.)
    
    -- 
    Eric Roman                       Department of Physics
    510-642-7302                     UC Berkeley
    

  • Next message: Hideyuki Jitsumoto: "Re: Please advise me about restarting with BLCR"