Re: restart failed:Device or resource busy,found pid 4818 in use

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Thu Mar 25 2010 - 21:07:11 PDT

  • Next message: fengguang tian: "Re: restart failed:Device or resource busy,found pid 4818 in use"
    fengguang,
    
    I believe that most MPI implementations will TRY to "do the right thing" 
    if signaled with a SIGTERM or SIGINT (SIGTERM is the default for the 
    kill command).  However, it cannot always do so if things are in a bad 
    state such as hung processes.  It also cannot do so if you send SIGKILL, 
    which does not allow mpirun any opportunity to kill the application 
    processes.
    
    In your case you indicated in previous email that opmi-checkpoint hangs 
    for you.  That is probably a good indication that the MPI jobs is in the 
    sort of "bad state" I warned about above.  So, you might need to 
    manually kill the MPI application processes on all the nodes now.  It is 
    possible that Open MPI may include a command to assist in that, but if 
    so I don't know what it is.
    
    -Paul
    
    
    fengguang tian wrote:
    > I have killed the orginal MPI job manually on the master node using 
    > kill command, and then I restart the job, it couldn't be that reason.
    >
    > or I need to kill the process both on master and slave nodes?
    >
    > cheers
    > fengguang
    >
    > On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov 
    > <mailto:PHHargrove_at_lbl_dot_gov>> wrote:
    >
    >     The message says that there are some pids (process IDs) in use
    >     (allocated to running processes) that are needed for the restart.
    >     This typically happens if one tries to restart when the original
    >     run has not yet exited, for instance if there are portions of it hung.
    >
    >     With very large clusters it becomes a statistically significant
    >     possibility that one could have a few random collisions with other
    >     processes on the nodes.
    >     However the number and grouping of the pids, I strongly suspect
    >     the original MPI job is still running or is hung.
    >
    >     -Paul
    >
    >
    >     fengguang tian wrote:
    >
    >         Hi
    >
    >         when I use ompi-restart to restart the checkpoint file in
    >         clusters(using open MPI), error happened,it shows:
    >         - found pid 4813 in use
    >         - found pid 4824 in use
    >         - found pid 4827 in use
    >         Restart failed: Device or resource busy
    >         - found pid 4812 in use
    >         - found pid 4822 in use
    >         - found pid 4823 in use
    >         Restart failed: Device or resource busy
    >         - found pid 4815 in use
    >         - found pid 4828 in use
    >         - found pid 4829 in use
    >         Restart failed: Device or resource busy
    >         - found pid 4818 in use
    >         - found pid 4819 in use
    >         Restart failed: Device or resource busy
    >         - found pid 4814 in use
    >         - found pid 4825 in use
    >         - found pid 4826 in use
    >         Restart failed: Device or resource busy
    >
    >
    >         why would this happen?
    >
    >         cheers
    >         fengguang
    >
    >
    >
    >     -- 
    >     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >     <mailto:PHHargrove_at_lbl_dot_gov>
    >     Future Technologies Group                 Tel: +1-510-495-2352
    >     HPC Research Department                   Fax: +1-510-486-6900
    >     Lawrence Berkeley National Laboratory    
    >
    >
    
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group                 Tel: +1-510-495-2352
    HPC Research Department                   Fax: +1-510-486-6900
    Lawrence Berkeley National Laboratory     
    

  • Next message: fengguang tian: "Re: restart failed:Device or resource busy,found pid 4818 in use"