Re: BLCR Error

From: Eric Roman (ESRoman_at_berkeley_dot_edu)
Date: Thu Sep 25 2008 - 17:27:18 PDT

  • Next message: Rooster Boy: "Invitation from Rooster Boy"
    Dear Jin,
    
    I'm not sure exactly what's wrong, but based on your error, it sounds
    like the shell script wasn't started with cr_run.  That's where you
    would get the 'Checkpoint failed: support missing from application'.
    
    The first thing I would try is writing a wrapper around cr_run that
    would write messages to the system log.
    
    Something like this:
    
    Change the MOM configuration from
    $checkpoint_run_exe /usr/local/bin/cr_run
    to
    $checkpoint_run_exe /usr/local/bin/cr_run.logging
    
    and write create a script /usr/local/bin/cr_run.logging
    
    #!/bin/bash
    logger $0: $@
    /usr/local/bin/cr_run $@
    exit $!
    
    You'll see messages on the system log every time a checkpointable job
    starts.  If you don't see any messages, then for some reason, MOM isn't
    invoking cr_run.  
    
    If you do see messages, it means that (for some reason) the checkpoint
    library isn't being preloaded into your application.  This could happen
    for a few reasons -- a few obvious ones being that the checkpoint
    library wasn't found (unlikely), one of the processes is statically
    linked, or the LD_PRELOAD set by cr_run is being lost somewhere in the
    environment.  If that's the case, try checkpointing something else.
    
    Eric
    
    
    On Tue, Sep 23, 2008 at 04:29:45PM +1000, Jin Zhang wrote:
    > Dear BLCR,
    > 
    > I've got a problem by using BLCR.
    > I install BLCR in cluster, and tried to run with Torque for a serial job.
    > I've configured Torque with --enable-blcr, I've installed BLCR into kernel with insmod, and I've create the script that mom_priv need.
    > 
    > However, when I run qhold, there was an error message as following:
    > 
    > Sep 23 15:43:00 wayland003 pbs_mom: mach_checkpoint, checkpoint args: /usr/spool/PBS/mom_priv/blcr_checkpoint_script 28676 155.wayland.in.vpac.org wl /usr/spool/PBS/checkpoint ckpt.155.wayland.in.vpac.org.1222148580 15
    > Sep 23 15:43:00 wayland003 checkpoint_script: Invoked: /usr/spool/PBS/mom_priv/blcr_checkpoint_script 28676 155.wayland.in.vpac.org wl /usr/spool/PBS/checkpoint ckpt.155.wayland.in.vpac.org.1222148580 15 
    > Sep 23 15:43:00 wayland003 checkpoint_script: Subcommand (cr_checkpoint --signal 15 --tree 28676 --file ckpt.155.wayland.in.vpac.org.1222148580) failed with rc=16777215: 
    > 
    > Then I check qstat -f 155, Job_state = R, it still running.
    > 
    > When I ran:
    > cr_checkpoint --signal 15 --tree 28676 --file ckpt.155.wayland.in.vpac.org.1222148580,
    > there was another error:
    > Checkpoint failed: support missing from application
    > 
    > Can you please tell me what's the problem
    > 
    > Thanks
    > 
    > -- 
    > Jin Zhang
    > 
    > Systems Administrator
    > Victorian Partnership for Advanced Computing
    > 110 Victoria St. Carlton South, VIC, 3053 AU
    > E: jin_at_vpac_dot_org    P: +61 (03) 9925 4942 
    
    -- 
    Eric Roman                       Department of Physics
    510-642-7302                     UC Berkeley
    

  • Next message: Rooster Boy: "Invitation from Rooster Boy"