Re: restart failed:Device or resource busy,found pid 4818 in use

From: fengguang tian (fernyabc_at_gmail_dot_com)
Date: Thu Mar 25 2010 - 21:15:29 PDT

  • Next message: Tao Ke: "Re: question about "cr_save_mmaps_data" function"
    Hi,Paul
    so is that a bug?what can I do to avoid this error,the "bad state",to make
    the checkpoint sucessfully?
    
    cheers
    fengguang
    
    On Fri, Mar 26, 2010 at 12:07 AM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov>wrote:
    
    > fengguang,
    >
    > I believe that most MPI implementations will TRY to "do the right thing" if
    > signaled with a SIGTERM or SIGINT (SIGTERM is the default for the kill
    > command).  However, it cannot always do so if things are in a bad state such
    > as hung processes.  It also cannot do so if you send SIGKILL, which does not
    > allow mpirun any opportunity to kill the application processes.
    >
    > In your case you indicated in previous email that opmi-checkpoint hangs for
    > you.  That is probably a good indication that the MPI jobs is in the sort of
    > "bad state" I warned about above.  So, you might need to manually kill the
    > MPI application processes on all the nodes now.  It is possible that Open
    > MPI may include a command to assist in that, but if so I don't know what it
    > is.
    >
    > -Paul
    >
    >
    > fengguang tian wrote:
    >
    >> I have killed the orginal MPI job manually on the master node using kill
    >> command, and then I restart the job, it couldn't be that reason.
    >>
    >> or I need to kill the process both on master and slave nodes?
    >>
    >> cheers
    >> fengguang
    >>
    >> On Thu, Mar 25, 2010 at 8:53 PM, Paul H. Hargrove <PHHargrove_at_lbl_dot_gov<mailto:
    >> PHHargrove_at_lbl_dot_gov>> wrote:
    >>
    >>    The message says that there are some pids (process IDs) in use
    >>    (allocated to running processes) that are needed for the restart.
    >>    This typically happens if one tries to restart when the original
    >>    run has not yet exited, for instance if there are portions of it hung.
    >>
    >>    With very large clusters it becomes a statistically significant
    >>    possibility that one could have a few random collisions with other
    >>    processes on the nodes.
    >>    However the number and grouping of the pids, I strongly suspect
    >>    the original MPI job is still running or is hung.
    >>
    >>    -Paul
    >>
    >>
    >>    fengguang tian wrote:
    >>
    >>        Hi
    >>
    >>        when I use ompi-restart to restart the checkpoint file in
    >>        clusters(using open MPI), error happened,it shows:
    >>        - found pid 4813 in use
    >>        - found pid 4824 in use
    >>        - found pid 4827 in use
    >>        Restart failed: Device or resource busy
    >>        - found pid 4812 in use
    >>        - found pid 4822 in use
    >>        - found pid 4823 in use
    >>        Restart failed: Device or resource busy
    >>        - found pid 4815 in use
    >>        - found pid 4828 in use
    >>        - found pid 4829 in use
    >>        Restart failed: Device or resource busy
    >>        - found pid 4818 in use
    >>        - found pid 4819 in use
    >>        Restart failed: Device or resource busy
    >>        - found pid 4814 in use
    >>        - found pid 4825 in use
    >>        - found pid 4826 in use
    >>        Restart failed: Device or resource busy
    >>
    >>
    >>        why would this happen?
    >>
    >>        cheers
    >>        fengguang
    >>
    >>
    >>
    >>    --     Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    >>    <mailto:PHHargrove_at_lbl_dot_gov>
    >>
    >>    Future Technologies Group                 Tel: +1-510-495-2352
    >>    HPC Research Department                   Fax: +1-510-486-6900
    >>    Lawrence Berkeley National Laboratory
    >>
    >>
    >
    > --
    > Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    > Future Technologies Group                 Tel: +1-510-495-2352
    > HPC Research Department                   Fax: +1-510-486-6900
    > Lawrence Berkeley National Laboratory
    >
    

  • Next message: Tao Ke: "Re: question about "cr_save_mmaps_data" function"