Re: Problems with BLCR?

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Tue Jul 26 2005 - 10:49:05 PDT

  • Next message: Paul H. Hargrove: "Re: Problems with BLCR?"
    Typically this is an indication that the original pids are (still) in 
    use.  My guess is that the originaly mpi processes are still running.
    
    -Paul
    
    Jeff Squyres wrote:
    > A user was having problems with LAM + BLCR, so I got a guest account on 
    > his cluster and gave it a whirl.  With my own build of LAM/MPI, I'm able 
    > to checkpoint just fine (i.e., I get N+1 checkpoint files).  But when I 
    > try to restart, I get the following error:
    > 
    > [jeff@linf1 ~]$ cr_restart context.4037
    > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    > cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    > 
    > What does this mean?
    > 
    > I had checkpointed a simple "hello world" MPI application (4 MPI 
    > processes) on a single node.
    > 
    > The user has already been in contact with Paul -- from his initial post 
    > on the LAM list 
    > (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php):
    > 
    > "P.S. I am using a patched version of blcr to make it work on FC4. The
    > patch was given to me by Paul Hargrove."
    > 
    > The specific version of BLCR in use is:
    > 
    > [jeff@linf1 ~]$ cr_restart --version
    > cr_restart version 0.4.pre1_snapshot_2005_06_27
    > 
    > Sidenote: I notice that cr_checkpoint has a "--version" switch, but it 
    > is not listed in "cr_checkpoint --help" (which was somewhat confusing). 
    >  Ditto for cr_run.
    > 
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Paul H. Hargrove: "Re: Problems with BLCR?"