Re: More testing result about "Error in exec". Re: Error in exec

From: Eric Roman (ERoman_at_lbl_dot_gov)
Date: Fri May 21 2004 - 16:33:18 PDT

  • Next message: jcduell_at_lbl_dot_gov: "Re: Fw: BLCR checkpoint sizes"
    I'd like to give a shot at strace'ing this thing, to see what happens.
    
    AFAIK, there's not a way to restart with ptrace attached.  Best thing to do cheat
    and try something like (under zsh):
    
    (while strace -p 344 ; do ; : ; done ) |& grep -v "No such process"
    
    And then in another window type:
    cr_restart context.344
    
    Then kill the strace loop above when the after you get the "Error in exec"
    message.
    
    Let's see what the strace has to say...
    
     - E
    
    
    On Fri, May 21, 2004 at 11:47:11AM -0500, Kevin wrote:
    > I tested the blcr with LAM further. Seems right now the problem is caused by the checkpoint file in which mpirun is saved. For example, if  I use 
    > 
    > mpirun -np1 ./hello,  assume the pid of mpirun is 344
    > 
    > then there are two context files created: context.344 in which mpirun process information is saved, and context.344-n0-345 in which single "hello" process information is saved. I can use cr_restart to restart a process with context.344-n0-345 partially successfully (in fact, the restarted process can't stopped automatically, it just get stoke after execution);
    > but if using 
    > cr_restart context.344
    > then that's where "Error in exec" happened.  Is it true that we can't restart a set of processes that belong to a MPI program at the same time? I guess file context.344 should get engough information to let a MPI program with multiple processes restart together, not just what I used, to restart the individual process one by one.
    > 
    > 
    > 
    > ----- Original Message ----- 
    > From: "Kevin" <[email protected]>
    > To: <eroman_at_lbl_dot_gov>
    > Cc: <checkpoint_at_lbl_dot_gov>
    > Sent: Thursday, May 20, 2004 10:12 AM
    > Subject: Re: Error in exec
    > 
    > 
    > > Eric,
    > > 
    > > Thanks for your suggestion. I checked my PATH setting, it does include the path to mpirun which is in LAM/bin directory. If the problem is from crtcp, can we make some methods to solve it? 
    > > 
    > > Kevin
    > > 
    > > 
    > >  
    > > ----- Original Message ----- 
    > > From: "Eric Roman" <ERoman_at_lbl_dot_gov>
    > > To: "Kevin" <[email protected]>
    > > Cc: <checkpoint_at_lbl_dot_gov>
    > > Sent: Wednesday, May 19, 2004 11:48 AM
    > > Subject: Re: Error in exec
    > > 
    > > 
    > > > 
    > > > Kevin
    > > > 
    > > > Best I can tell, this is an error coming from LAM.  It looks like the "Error
    > > > in exec" message is produced by crtcp when it fails to exec a new mpirun.
    > > > 
    > > > Most likely reason for exec() to fail is that the executable wasn't found.
    > > > I'd check the path that the MPI app is using.  Make sure it includes mpirun.
    > > > 
    > > >  - E
    > > > 
    > > > On Wed, May 19, 2004 at 10:07:21AM -0500, Kevin wrote:
    > > > > Dear Sir, 
    > > > > 
    > > > > I used lam7.0.4 combined with blcr-0.2.0 to perform checkpoint mpi program. It works fine with single program and MPI program running on one node before.Today when I tried to checkpoint a MPI program (the "hello" program under example directory with LAM package)running on one node of our cluster, the MPI program could be checkpointed and context file is saved. But when I try to restart it, it returns "Error in exec" to the screen.I can't figure out where the problem is.Could you please give me some suggestion?
    > > > > 
    > > > > Below are some information on my operation and configuration:
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]mpirun C ./hello 
    > > > > //it works fine and information displayed at console 1, 
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src] getpid mpirun 
    > > > > //I got the pid of mpirun with a script "getpid" from console 2, assumed it is 344
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]cr_checkpoint 344
    > > > > //checkpoint the ./hello from console2, it works fine, the context.344 is saved to disk
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]cr_restart context.344
    > > > > Error in exec
    > > > > 
    > > > > ---below are configurations----------------------------------
    > > > > [kevin@Sparrow-01-02 ~/src]lamnodes
    > > > > n0      Sparrow-01-02.ERC.MsState.Edu:1:origin,this_node
    > > > > 
    > > > > [kevin@Sparrow-01-02 ~/src]laminfo
    > > > >            LAM/MPI: 7.0.4
    > > > >             Prefix: /home/kevin/LAM
    > > > >       Architecture: i686-pc-linux-gnu
    > > > >      Configured by: kevin
    > > > >      Configured on: Mon May  3 15:45:08 CDT 2004
    > > > >     Configure host: Sparrow-01-01.ERC.MsState.Edu
    > > > >         C bindings: yes
    > > > >       C++ bindings: yes
    > > > >   Fortran bindings: yes
    > > > >        C profiling: yes
    > > > >      C++ profiling: yes
    > > > >  Fortran profiling: yes
    > > > >      ROMIO support: yes
    > > > >       IMPI support: no
    > > > >      Debug support: no
    > > > >       Purify clean: no
    > > > >           SSI boot: globus (Module v0.5)
    > > > >           SSI boot: rsh (Module v1.0)
    > > > >           SSI coll: lam_basic (Module v7.0)
    > > > >           SSI coll: smp (Module v1.0)
    > > > >            SSI rpi: crtcp (Module v1.0.1)
    > > > >            SSI rpi: lamd (Module v7.0)
    > > > >            SSI rpi: sysv (Module v7.0)
    > > > >            SSI rpi: tcp (Module v7.0)
    > > > >            SSI rpi: usysv (Module v7.0)
    > > > >             SSI cr: blcr (Module v1.0.1)
    > > > > 
    > > > > 
    > > > >  
    > > > 
    > > > -- 
    > > > Eric Roman                       Computational Research Division
    > > > 510-486-6420                     Berkeley Lab
    > > > 
    > > 
    
    -- 
    Eric Roman                       Computational Research Division
    510-486-6420                     Berkeley Lab
    

  • Next message: jcduell_at_lbl_dot_gov: "Re: Fw: BLCR checkpoint sizes"