Re: checkpointing processes with >2GB on x86_64

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Wed Apr 18 2007 - 14:26:07 PDT

  • Next message: Thomas Zeiser: "Re: checkpointing processes with >2GB on x86_64"
    Thomas Zeiser wrote:
    > Dear All,
    > 
    > is there a 2 GB process limit for checkpointing on x86_64??
    
    There is not any intentional limit or technical limitation.  It is
    possible that you've encountered a BLCR bug.
    
    > 
    > On your system with
    > - SuSE SLES9sp3 x86_64 (kernel contains in addition Voltaire
    >   Infiniband and Intel VTune modules)
    > - blcr-0.5.3 built from source rpm
    > - socket nodes with Intel Xeon 5100 ("Woodcrest") CPUs
    > - I'm doing the tests from /tmp (formated with reiserfs) using
    >   cr_run
    > 
    > I observe the following:
    > - checkpointing and restarting a process with <2GB total size works
    >   fine ("simple" sequential Fortran code compiled with Intel 9.1 EM64T
    >   compilers, no sockets etc. open, just a few plain files)
    >   => no problems at all.
    > 
    > however, if I increase the working set to >2GB memory footprint
    > (i.e. same executable as memory is allocated dynamically)
    > - when calling "cr_checkpoint --term PID" the system often starts
    >   to swap  (e.g. for 5 GB working set on a system with 8 GB RAM)
    
    The swapping is "normal" for application working sets larger than about
    1/2 of physical memory, as the dump process will end up creating I/O
    buffers of equal volume.  We hope to work around that in the future.
    
    > - it takes quite long time and suddenly cr_checkpoint disappears
    >   (with exit code 5 if I've seen it correctly) but no context.### 
    >   file has been written
    
    The long time is probably the swapping.  No file is written because
    cr_checkpoint is witting to a temporary file that it renamed on success,
    but unlinked on error.  There is currently no way to keep the file on error.
    
    The exit code 5 corresponds to errno=EIO, consistent w/ the message on
    STDERR.
    
    > - on STDERR I see
    > ioctl(/proc/checkpoint/ctrl, CR_OP_CHKPT_REAP): Input/output error
    > - there are no further messages in dmesg or syslog
    > - and the application continues to run (despite --term, but that
    >   might be fine as no context file is written)
    >   => no restart for >2GB although OS and application are 64-bit !?
    
    The lack of a context file *is* why the app continues to run.
    
    > Any ideas? Did I miss something?
    
    The first thing that comes to mind is to check for rlimit problems.  Run
    "ulimit -a" for a bourne-type shell, or "limit" for a C-shell.  Check
    the "filesize" limit to see if it is anything other than "unlimited".
    
    > 
    > 
    > Regards,
    > 
    > thomas
    
    -Paul
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Thomas Zeiser: "Re: checkpointing processes with >2GB on x86_64"