Re: restart failed:Device or resource busy,found pid 4818 in use

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Thu Mar 25 2010 - 18:23:13 PDT

  • Next message: TK: "Re: question about "cr_save_mmaps_data" function"
    I have killed the orginal MPI job manually on the master node using kill
    command, and then I restart the job, it couldn't be that reason.
    
    or I need to kill the process both on master and slave nodes?
    
    cheers
    fengguang
    
    On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote:
    
    > The message says that there are some pids (process IDs) in use (allocated
    > to running processes) that are needed for the restart.
    > This typically happens if one tries to restart when the original run has
    > not yet exited, for instance if there are portions of it hung.
    >
    > With very large clusters it becomes a statistically significant possibility
    > that one could have a few random collisions with other processes on the
    > nodes.
    > However the number and grouping of the pids, I strongly suspect the
    > original MPI job is still running or is hung.
    >
    > -Paul
    >
    >
    > fengguang tian wrote:
    >
    >> Hi
    >>
    >> when I use ompi-restart to restart the checkpoint file in clusters(using
    >> open MPI), error happened,it shows:
    >> - found pid 4813 in use
    >> - found pid 4824 in use
    >> - found pid 4827 in use
    >> Restart failed: Device or resource busy
    >> - found pid 4812 in use
    >> - found pid 4822 in use
    >> - found pid 4823 in use
    >> Restart failed: Device or resource busy
    >> - found pid 4815 in use
    >> - found pid 4828 in use
    >> - found pid 4829 in use
    >> Restart failed: Device or resource busy
    >> - found pid 4818 in use
    >> - found pid 4819 in use
    >> Restart failed: Device or resource busy
    >> - found pid 4814 in use
    >> - found pid 4825 in use
    >> - found pid 4826 in use
    >> Restart failed: Device or resource busy
    >>
    >>
    >> why would this happen?
    >>
    >> cheers
    >> fengguang
    >>
    >
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    >
    

  • Next message: TK: "Re: question about "cr_save_mmaps_data" function"