Re: lam/mpi blcr problem

From: 任明明 (0110018_at_mail.nankai.edu.cn)
Date: Wed Mar 23 2005 - 06:27:18 PST

  • Next message: Jeff Squyres: "Re: lam/mpi blcr problem"
    thank you for your help!
    I can use blcr to checkpoint the non-MPI program,such as the examples 
    included in the blcr software.And all the nodes are ok to checkpoint a 
    non-MPI program.
    but when i use cr_checkpoint to checkpoint a MPI program, it doesn't generate
    context file for each process, only generate a context file for mpirun command.
    
    all i do is the the following:
    
    In one window:
    ****************************************************
    [rmingming@node01 lam]$ mpicc cpi.c -o cpi
    [rmingming@node01 lam]$ lamboot -v nodes
    
    LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
    
    n-1<8238> ssi:boot:base:linear: booting n0 (node01)
    n-1<8238> ssi:boot:base:linear: booting n1 (node02)
    n-1<8238> ssi:boot:base:linear: booting n2 (node03)
    n-1<8238> ssi:boot:base:linear: booting n3 (node04)
    n-1<8238> ssi:boot:base:linear: finished
    [rmingming@node01 lam]$ mpirun C -ssi rpi crtcp -ssi cr blcr ./cpi
    Process 0 on node01
    Process 1 on node02
    Process 3 on node04
    Process 2 on node03
    Enter the number of intervals: (0 quits) 0 (---during this i use cr_checkpoint)
    [rmingming@node01 lam]$
    
    ******************************************************
    
    in another window:
    
    ******************************************************
    
    [rmingming@node01 lam]$ cr_checkpoint 8248
    [rmingming@node01 lam]$ ls
    context.8248  cpi  cpi.c  hello.c  nodes  ring
    (i can't find the context files for each process, i also checked the home dir)
    [rmingming@node01 lam]$ cr_restart context.8248
    mpirun (rpwait): Bad file descriptor
    [rmingming@node01 lam]$
    
    ******************************************************
    
    hope to receive from you all :)
    
    在您的来信中曾经提到:
    >From: Jeff Squyres <[email protected]>
    >Reply-To: 
    >To: checkpoint_at_lbl_dot_gov
    >Subject: Re: lam/mpi blcr problem
    >Date:Tue, 22 Mar 2005 15:23:46 -0500
    >
    >On Mar 22, 2005, at 12:05 PM, Paul H. Hargrove wrote:
    > 
    > > I am sorry to hear that you are having problems.  Lets see if we can 
    > > help.
    > >
    > > As far as I can tell your LAM configuration is OK, but I am cc:ing 
    > > this to one of the LAM developers who may be able to spot something I 
    > > could not.
    > 
    > No need -- I'm actually on the checkpoint_at_lbl_dot_gov list.  :-)
    > 
    > > Have you tried 'make check' in the blcr build directory or 
    > > checkpointing/restarting some of the non-mpi examples in blcr's 
    > > examples directory?  It would be good to know that the blcr build was 
    > > OK before bring LAM into the mix.
    > >
    > > When LAM ran the mpi application, was blcr installed (and the kernel 
    > > modules loaded) on all the compute nodes running the mpi job?
    > 
    > Additionally, were you using the crtcp RPI?  I.e., what was the 
    > specific command that you used to mpirun your application?  And how did 
    > you try to checkpoint it?
    > 
    > -- 
    > {+} Jeff Squyres
    > {+} [email protected]
    > {+} http://www.lam-mpi.org/
    > 
    >
    

  • Next message: Jeff Squyres: "Re: lam/mpi blcr problem"