Announcing the release of BLCR 0.7.0

From: Paul H. Hargrove (PHHargrove_at_lbl_dot_gov)
Date: Fri May 30 2008 - 17:58:54 PDT

  • Next message: Parviz Fariborz: "Error from re-start on very large context file"
    I am pleased to announce the release of BLCR 0.7.0.
    
    The 0.7.0 release is now available from the BLCR Downloads page:
    http://ftg.lbl.gov/CheckpointRestart/CheckpointDownloads.shtml
    
    Relative to the 0.6.x series, this release includes many new features
    and some improvements in stability.  This release contains all of the
    new development in BLCR since the 0.6.0 release in September 2007 (0.6.x
    for x>0 were primarily fixes for bugs or addition of support for newer
    kernels).
    
    Many features have been added to the cr_checkpoint and cr_restart
    utilities, mostly to aide use of BLCR in a batch environment.  This
    release also adds experimental support for PPC32 kernels, and for
    Xen-enabled (paravirtualized) kernels.
    
    A summary of the user-visible changes in BLCR, relative to 0.6.5,
    appears below in the form of an excerpt from the NEWS file.
    
    -Paul
    
    PS
    You are receiving this either because you are on the checkpoint_at_lbl_dot_gov
    list, because you've recently sent email to the list (or me directly)
    asking about BLCR status, or because our Bugzilla shows your interests
    in a bug fixed in this beta.
    
    
    NEWS:
    0.7.0
    --------
    May 30, 2008
    Enhanced functionality and expanded-support release.
      - This release adds support for 2.6.25 kernels.
      - This release adds EXPERIMENTAL support for 32-bit PPC platforms.
      - This release adds EXPERIMENTAL support for Xen-enabled kernels (both
        dom0 and domU).  In our testing so far it either works great or not
        at all, depending on the installation.  We have not yet identified
        what distinguishes the working systems from the broken ones, and
        would appreciate feedback on your success or failure using BLCR
        with Xen-enabled Linux kernels.
      - As previously announced, this release removes support for 2.4.x kernels
        with the current exception of the RH9 and RHEL kernels (and derivatives)
        that contain backported NPTL support.  These too may be removed in the
        not too distant future.
      - As previously announced, this release begins the removal of support for
        LinuxThreads.  If you are using LinuxThreads you may experience random
        failures in the BLCR testsuite and with your own multithreaded apps.
        New interest in using BLCR with non-GNU libc may lead to a return of
        LinuxThreads support.  Please contact us if you have interest it this.
      - This release adds the following features to the cr_checkpoint utility:
         + --quiet option to suppress output from cr_checkpoint
         + --noclobber option (don't disturb existing files)
         + Options for treatment of ptrace() child and parent processes:
            --ptraced-{error,skip,allow}
            --ptracer-{error,skip}
         + Options to save/restore executables and libraries in context files:
            --save-{exe,private,shared,all,none}
        See the cr_checkpoint manpage for more details on each of these.
      - This release adds the following features to the cr_restart utility:
         + --quiet option to suppress output from cr_restart
           Previously there was no way to do so without also losing the output
           of the restarted process(es).
         + Add --run-on-* family of options to provide user-specified error
           handling hooks.  Previously there was no way to automatically/safely
           distinguish a failure of cr_restart from a non-zero exit from the
           restarted application.  This resolves bug 1974.
         + Add --relocate option to enable restart-time replacement of file and
           and directory paths saved in the context file.
         + Add --fd option to restore from an open fd, rather than a file.
        See the cr_restart manpage for more details on each of these.
      - This release adds the following feature to the cr_run utility:
         + --omit option to run a process with BLCR support such that the
           process (and its descendants) will be omitted from checkpoints.
      - This release makes the following libcr API additions/changes:
         + cr_forward_checkpoint() is fully tested and thus no longer labeled
           as "use at your own risk".  Its documentation in libcr.h is now
           complete as well.
         + cr_register_hook() is fully tested and thus no longer labeled
           as "use at your own risk".
         + As anticipated in previous releases, the error code returned from
           cr_poll_checkpoint() has CHANGED for the case of restarting from a
           checkpoint of oneself.  This may break existing code that will not
           be prepared for the new errno value.  However, the previous value
           of EINVAL could have masked actual invalid-argument errors.
           The alternative of returning 0 (success) was considered, but was
           discarded because it was deemed valuable to be able to reliably
           distinguish whether one was continuing or restarting from a
           checkpoint of oneself.
         + As alternatives to CR_ETEMPFAIL and CR_PERMFAIL, authors of BLCR
           callbacks can now specify the errno values to be returned to the
           checkpoint requester on a case-by-case basis.
         + cr_tryenter_cs() has been added as a non-blocking alternative to
           the cr_enter_cs() function.
         + Several functions that would previously respond to usage errors
           with an abort() or an undefined behavior now return -1 with one of
           two new values of errno:
            * CR_ENOINIT  - caller has not made the required call to cr_init()
            * CR_ENOTCB   - call is only valid from a checkpoint callback
         + The following functions are marked as deprecated:
            * cr_request()
            * cr_request_file()
            * cr_request_fd()
           They do not provide any mechanism to specify the checkpoint scope
           or to detect errors.  They are currently implemented as (awkward)
           wrappers around cr_request_checkpoint() and cr_poll_checkpoint().
           They are scheduled to be removed in 0.8.0.
        See the comments in include/libcr.h for API documentation.
      - This release introduces two "stub" libraries: libcr_run and libcr_omit.
        They differ from the "full" libcr library in that they contain only a
        BLCR signal handler and the initialization code to register it.  They
        do not include any of the entry points declared in libcr.h, and the
        handler code does not run any callbacks.
        The cr_run utility now uses these libraries in LD_PRELOAD variable,
        rather than the full libcr.so used in previous releases.
        See the BLCR User's Guide for information on using these libs.
      - This release makes several additions to the BLCR test suite, including
        tests of most of the features new to this release and some motivated by
        bugs fixed in this release.  Many existing tests have been expanded to
        exercise additional corner cases.
      - This release makes the following changes to "configure" behavior:
         + The --with-linux= option now accepts a kernel revision (the output
           or "uname -r") as a value, causing configure and search for that
           revision in some standard locations.  This is intended to make it
           easier and less error prone to specify for which kernel to build.
           In most modern distributions, this single option will be sufficient
           to configure BLCR for any installed kernel.
         + Previously --with-linux= would be used to specify a kernel source
           directory, and if needed --with-linux-obj= could be given to help
           find the corresponding build directory.  With this release the
           role of --with-linux= as changed to be that of a build directory
           and the option --with-linux-src= is available if the sources can't
           be found automatically.
         + The configure-time probes of the kernel headers and configuration
           are now performed using the full CFLAGS/CPPFLAGS from the Linux
           kbuild infrastructure.  This ensures proper configuration with
           Xen-enabled kernels that prepend Xen-specific components to the
           include path.
         + At configure time one can set KCC to specify that the kernel
           modules are to be built with a different C compiler than the
           user-space components of BLCR.
      - On the ARM platform, the "good enough for LinuxThreads" implementation
        of atomic operations has been replaced with truly atomic ops based
        on the kernel-level support added for NPTL.
      - As a temporary work-around for bug 2251, BLCR will currently refuse
        to checkpoint processes with files on hugetlbfs mmap()ed with the
        MAP_PRIVATE flag.  This is to avoid potentially serious instability
        that may result if BLCR attempts to checkpoint such a process.
      - Fixes the following user-visible bugs and "issues"
         + 1974 - Make it possible to decide whether a restart succeeded
         + 2023 - ARM atomics need update
         + 2042 - libcr.so not linked to libc
         + 2124 - Make check fails at pid_in_use.st
         + 2214 - close cri_live_count race
         + 2216 - Move post-restart signal delivery to post-callback
         + 2247 - BLCR assumes 64-bit gcc on 64-bit arch
         + 2248 - Separate CC and KCC
         + 2265 - CR_CHECKPOINT_OMIT generates 0 byte context file
         + 2266 - process doing CR_CHECKPOINT_TEMP_FAILURE is killed!
         + 2271 - cr_checkpoint --clobber fails to overwrite file
         + 2272 - cr_get_restart_info returns wrong src path
         + 2274 - Invalid (zero or huge) pids seen at restart time
         + Remove bash-specific assumptions in the tests
         + Stronger validation of BLCR against proper kernel version
         + Validation of BLCR's kernel module versions against each other
         + Performance improvements from better memory management and
           from coalescing of background work.
         + Preserve error codes for I/O errors.  Previously any error
           from a read/write at kernel level was reported as EIO.  Now
           the original errors (such as ENOSPC) are preserved.
         + Several others found by internal testing or reported by email
           and fixed without assigning bug numbers.
      - The file contrib/blcr.magic contains the format description needed
        by the "file" utility to identify BLCR's context files.
    
    -- 
    Paul H. Hargrove                          PHHargrove_at_lbl_dot_gov
    Future Technologies Group
    HPC Research Department                   Tel: +1-510-495-2352
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    

  • Next message: Parviz Fariborz: "Error from re-start on very large context file"