Problems with BLCR?

From: Jeff Squyres (jsquyres_at_open-mpi.org)
Date: Mon Jul 25 2005 - 06:24:05 PDT

  • Next message: Pradeep Padala: "Re: Problems with BLCR?"
    A user was having problems with LAM + BLCR, so I got a guest account on 
    his cluster and gave it a whirl.  With my own build of LAM/MPI, I'm 
    able to checkpoint just fine (i.e., I get N+1 checkpoint files).  But 
    when I try to restart, I get the following error:
    
    [jeff@linf1 ~]$ cr_restart context.4037
    cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    cri_syscall(CR_OP_RSTRT_REQ, &req): Device or resource busy
    
    What does this mean?
    
    I had checkpointed a simple "hello world" MPI application (4 MPI 
    processes) on a single node.
    
    The user has already been in contact with Paul -- from his initial post 
    on the LAM list 
    (http://www.lam-mpi.org/MailArchives/lam/2005/07/11015.php):
    
    "P.S. I am using a patched version of blcr to make it work on FC4. The
    patch was given to me by Paul Hargrove."
    
    The specific version of BLCR in use is:
    
    [jeff@linf1 ~]$ cr_restart --version
    cr_restart version 0.4.pre1_snapshot_2005_06_27
    
    Sidenote: I notice that cr_checkpoint has a "--version" switch, but it 
    is not listed in "cr_checkpoint --help" (which was somewhat confusing). 
      Ditto for cr_run.
    
    -- 
    {+} Jeff Squyres
    {+} The Open MPI Project
    {+} http://www.open-mpi.org/
    

  • Next message: Pradeep Padala: "Re: Problems with BLCR?"