Re: trying to integrate OpenMPI+BLCR+SGE

From: Alan Woodland (alan.woodland_at_gmail_dot_com)
Date: Wed Nov 04 2009 - 01:37:29 PST

  • Next message: Josh Hursey: "Re: trying to integrate OpenMPI+BLCR+SGE"
    2009/11/3 Sergio D�az <[email protected]>
    > I can do checkpointing of an easy program without SGE (just in one compute with 2 mpi process
    > for instance). Now, I'm trying to do the integration openmpi+sge but I have some problems... When > I try to do checkpoint of the mpirun PID, I got an error similar to the error gotten when the PID
    > doesn't exit. The example below.
    
    That error looks like the error when job wasn't started with "-am
    ft-enable-cr" passed to MPI run. Given that the output you pasted
    shows "-am ft-enable-cr" was present this would lead me to suspect
    that something went wrong during the startup of mpirun. Do you have
    logs of std{out,err} from this at all. IIRC if checkpointing setup
    fails in OpenMPI at startup for some reason a few messages get printed
    and things just carry on regardless. Is there anything helpful in a
    verbose/debug output too?
    
    > There is a script to do it automatic with SGE?. For instance, to do checkpointing each X seconds
    > with BLCR and non-mpi jobs, there is an script that I adapted to my case. It is launched by SGE if
    > you have configured the queue and the ckpt environment.
    
    I've never used SGE, only Condor, and I've never done MPI+BLCR+Condor
    so I can't really help there I'm afraid. Is it possible SGE is making
    mpi use a transport other than sm, tcp or self? I'm not sure if the
    checkpointing code works with other transports.
    
    > Is it possible choose the name of the ckpt folder when you do the ompi-checkpoint? I can't find the
    > option to do it.
    
    I think mpirun --tmpdir might help with this one?
    
    [snip]
    
    Alan
    

  • Next message: Josh Hursey: "Re: trying to integrate OpenMPI+BLCR+SGE"