BLCR Frequently Asked Questions (for version 0.8.5)

General Questions

Build/Install Questions

Usage Questions

Additional Resources


General Questions

What is BLCR?

BLCR (Berkeley Lab Checkpoint/Restart) allows programs running on Linux to be "checkpointed" (written entirely to a file), and then later "restarted". BLCR can be found at http://ftg.lbl.gov/checkpoint.

How is checkpoint/restart different than SIGSTOP/SIGCONT?

Putting a process to sleep (via the SIGSTOP signal) implies stopping its execution. Taking a checkpoint writes a snapshot of a process to disk: the process may either be allowed to continue running after the checkpoint is complete, or you can kill the process to release all of its resources .

With sleep, a process's resources are not all fully released (such as virtual memory, network connections, process id, etc.). Checkpointing then killing a process fully releases all system resources.

Restarts from checkpoint files can be used across machine reboots, and/or even on different machines than the one that the checkpoint was taken on. This is not true for SIGCONT.

How is BLCR different than "user-level" checkpointing libraries like Condor, etc.?

BLCR performs checkpointing and restarting inside the linux kernel. While this makes it less portable than solutions that use user-level libraries, it also means that it has full access to all kernel resources, and can thus restore resources (like process IDs) that user-level libraries cannot. This also allows BLCR to checkpoint/restart groups of processes (such as shell scripts and their subprocesses), together with the pipes that connect them.

What kinds of Linux systems does BLCR support?

BLCR runs on x86 and x86_64 (Opteron/EM64T) systems running Linux 2.6.x and 3.x.y kernels. With the 0.8.5 release, we believe the following to work:

BLCR 0.7.0 added experimental support for PPC (32-bit), and 0.6.0 added experimental support for PPC64 and ARM. These three architectures have been tested as follows:

We are interested to hear of your success or failure with these three experimental architectures, especially on kernels older than those we have tested.

Note that 0.6.x was the last release series to support 2.4.x kernels.

What Linux distributions does BLCR work with?

BLCR uses a set of configure-based tests to determine which kernel features are available, and so in principle, BLCR should work with any distribution that uses a supported CPU/kernel combination (see above).

Historically, BLCR has been tested with kernels from numerous versions of SuSE, RHEL (and clones such as CentOS and Scientific Linux), Fedora, Debian, Ubuntu, and many vanilla Linux kernels (from kernel.org) from 2.6.0 on up. We have not tested every single version of the kernel from every vendor, nor is each BLCR release retested against all distributions tested in the past. However, we believe that BLCR should work on most distributions using kernels in the ranges given above (except where vendors may have applied patches that bring in problematic changes from kernels outside that range).

If after reading this question and the one above you believe your platform should be supported but cannot get BLCR to work, then please consult our bug database for possible solutions and then report the problem if you don't find it already reported. We can't try every possible platform ourselves and count on user's bug reports to let us know when our testing has missed something.

Can BLCR checkpoint/restart multithreaded programs?

Yes, BLCR can checkpoint both single- and multithreaded (pthreads) programs linked with the NPTL implementation of pthreads. Since with the 0.7.0 release, BLCR no longer provides full support for the older LinuxThreads implementation of pthreads. Please contact us if you are able to devote some effort to restoring/maintaining such support in the future.

BLCR has not been tested with other threading packages, such as those used by some Java runtimes. We are interested in hearing of both success and failure with other threading packages.

Can BLCR checkpoint/restart multi-process applications?

Yes, starting with version 0.5.0 BLCR is able to save and restore groups of related processes together with the pipes that connect them. To do this, BLCR must be given a single request that covers all the processes involved. Currently there are three ways to specify a group request to BLCR:

While BLCR can save and restore the pipes used for IPC among processes in these groups, it is unable at this time to deal with most other IPC mechanisms (see next FAQ).

Are there limits to the types of programs can BLCR checkpoint?

Yes. BLCR does not support checkpointing certain process resources. While the following list is not exhaustive, it lists the most significant issues we are aware of.

If needed, applications can arrange to save any information necessary to recreate/reacquire such resources at restart time (see next FAQ).

What if I want to checkpoint a program that uses resources that BLCR can't checkpoint?

BLCR supports adding 'callbacks' to user-level code, which are called when a checkpoint is about to be performed, and when it is restarted (or continues on after the checkpoint). This is how MPI communication can be handled (see next FAQ).

Full documentation of the callback interface has not yet been written because some of the interfaces are still subject to change. However, the comments in the file libcr.h should provide enough to get started.

Does BLCR support checkpointing parallel/distributed applications?

Not by itself. But by using checkpoint callbacks (see previous FAQ). many MPI implementations have made themselves checkpointable by BLCR. You can checkpoint/restart an MPI application running across an entire cluster of machines with BLCR, without any application code modifications, if you use one of these MPI implementations (listed alphabetically): See the documentation of your specific MPI for usage instructions. In almost all cases you will need to use a tool provided by the MPI implementation to request a checkpoint or restart, rather then using BLCR's cr_checkpoint and cr_restart utilities.

At this time we are not aware of other MPI implementations that are working on BLCR support, but surprisingly our information is not always the latest. If in doubt, check the support channels of your favorite MPI implementation

Note that any questions about using these MPI implementations with BLCR is more likely to receive a useful response if directed to the support channels of the specific MPI implementation than to the [email protected] list.

Do I need root access in order to use BLCR?

Root access is needed to install the BLCR kernel modules. However, once these are installed, any user can checkpoint and restart their own programs without needing root permission.

Is BLCR integrated with any batch systems?

We are aware of the following, but we are not always informed of new efforts to integrate with BLCR. For the most up-to-date information you should consult the support channels of your favorite batch system.
TORQUE version 2.3 and later
Support for serial and parallel jobs, including periodic checkpoints and qhold/qrls.
SLURM version 2.0 and later
Support for automatic (periodic) and manually requested checkpoints.
SGE (aka Sun Grid Engine)
Information on configuring SGE to use BLCR can be found here. There is also a thread on the [email protected] list about modifications to those instructions. The thread begins with this posting.
LSF
Information on configuring LSF to use BLCR can be found in this posting on the [email protected] list.
Condor
Information on configuring Condor to use BLCR to checkpoint "Vanilla Universe" jobs with the help of Parrot can be found here.
Work is ongoing by third parties to integrate BLCR into other batch systems. If you are interested in adding BLCR support to a job launcher/scheduler, please contact us!

Note that any questions about using these batch systems with BLCR is more likely to receive a useful response if directed to the support channels of the specific batch system than to the [email protected] list.

How hard is it to port BLCR to an architecture that isn't currently supported?

Most of the architecture-specific code in BLCR is confined to small set of logic to save and restore the CPU-specific register set (vmadump) and some gcc inline assembly for atomic operations and special system calls. The majority of BLCR's code base is entirely processor-independent.

If you are interested in seeing BLCR run on other chips, and are able to devote programmer resources, please contact us! The Alpha platform is likely to be the easiest since vmadump already supports this architecture for Linux 2.6 kernels.


Build/Install Questions

Does BLCR require a kernel patch?

No. All of the kernel logic used by BLCR is implemented within kernel modules. You can thus compile BLCR and load it into a running kernel (with modprobe or insmod) without needing to recompile your kernel or reboot.

What do I need in order to build and use BLCR?

A machine that is running a supported architecture (x86 and x86_64 are fully supported and PPC/PPC64 and ARM are "Experimental") and Linux kernel 2.6.x or 3.x.y.

A set of configured kernel headers that matches the kernel you wish to build against. By configured, we mean that include/linux/version.h and the files in include/linux/modules/ match the target kernel. For many distributions a kernel-devel or linux-headers package is often enough if using the vendor's kernel. For a custom kernel, the actual kernel build directory is often required.

The kernel's symbol table. Normally the file /boot/System.map, or equivalent will serve this purpose.

What if I my kernel sources are unconfigured?

BLCR needs to be able to examine a linux kernel source tree that has been configured, and this configuration must match the kernel that you will run BLCR against.

If you do not have a configured linux kernel source tree, you may be able to create one fairly easily. Many distributions provide a 'config' file that is all you need to easily produce a configured kernel source tree. Good places to look for a config file include /boot/config-2.6.5-1.358 or /config-2.6.5-1.358. In some distributions, the kernel is actually setup to include its configuration in /proc/config.gz (or /proc/config.bz2). If you can find any one of these files then we can proceed with the following steps:

  1. Make a copy of the unconfigured source for the linux kernel you are using, and copy in the file you located:
      $ cp -a /usr/src/kernel-source-2.6.5 /tmp/linux-2.6.5-1.358
      $ cd /tmp/linux-2.6.5-1.358
      $ cp [CONFIG_FILE] .config
    
  2. Configure it using one of the following:
    • For kernels 2.6.6 and newer:
        $ make modules_prepare
      
    • For 2.6.x kernels prior to 2.6.6:
        $ make prepare-all scripts
      
  3. Once that is done, you should be able to configure BLCR using the newly configured kernel source. You should continue to use the System.map file from your running kernel. What you want is probably something like
      $ ./configure --with-system-map=/boot/System.map-2.6.5-1.358 --with-linux=/tmp/linux-2.6.5-1.358.
    

How do I get past the error "linux/modversions.h: No such file or directory"?

Please try rebuilding blcr after commenting out the following six lines near the top of the files vmadump/vmadump.c and blcr_imports/imports.c:

   #if defined(CONFIG_MODVERSIONS) && ! defined(MODVERSIONS)
     #define MODVERSIONS
   #endif
   #if defined(MODVERSIONS)
     #include <linux/modversions.h>
   #endif
Let us know if your compilation still doesn't work.

What if I get an error about my System.map file?

To build, BLCR needs to read the System.map file that corresponds to the kernel you will use BLCR with. Generally, BLCR will find this file "automagically" during ./configure, but some distributions do not provide it, and/or you may not keep yours in a standard place.

If you know where the correct System.map file is, use

  $ ./configure --with-system-map=PATH_TO_YOUR_SYSTEM.MAP

If your System.map is absent, it may still be available as an optional RPM. For instance you may be able to get it by installing (depending on the release) either the kernel-source or kernel-devel RPMs for the kernel you will use BLCR with.

However, Fedora Core 2 and some of its derivatives are shipping a "stripped-down" System.map file. If this is the case, BLCR will abort during configuration with an error stating that the System.map cannot be used. You must install an additional RPM which contains a full System.map in order to build BLCR. In Fedora Core 2 the 'kernel-debuginfo' RPM contains a full System.map file, which it will install into the /usr/lib/debug/boot directory. BLCR's configure script will search this directory, but just to be certain you may still wish to pass '--with-system-map' to point configure at the correct System.map file.

Important Note: If you need to install the kernel-debuginfo RPM, make sure the correct version is installed. Specifically, the 'arch' type must be the same. If your kernel was built for the 'i386' (or 'i586', or 'i686'), the kernel-debuginfo RPM must have the same value. Thus, for an i586 kernel, install 'kernel-debuginfo-2.6.5-1.358.i586.rpm'. To determine which kernel version you have, use
  $ rpm -q kernel --qf '%{version}-%{release}.%{arch}\n'
To make sure that you have installed compatible kernel and kernel-debuginfo RPMs, use
  $ rpm -q kernel kernel-debuginfo --qf '%{version}-%{release}.%{arch}\n'
(replace 'kernel' with 'kernel-smp' if you are using an SMP kernel). You should see the same string, repeated twice.

If you try to use BLCR with the wrong System.map, BLCR will build without complaints, but will probably detect the problem when the blcr.o kernel module is loaded (it does this by comparing some well-known exported kernel symbols' addresses to those provided by the System.map file), and the module load will be aborted.

Why do I get "Invalid module format" when loading BLCR's kernel modules?

Kernels from 2.6.0 through 2.6.18 check at module load time that the same compiler version (major and minor numbers) were used to build both the module and the kernel. This is the error you will see if they don't match. When this happens, you will need to reconfigure and build BLCR with the correct compiler. When a module fails to load due to a version mismatch, you should be able to find a message in the system logs indicating the required compiler version:
  blcr_imports: version magic '2.6.17 SMP mod_unload PENTIUM4 gcc-3.4' should be '2.6.17 SMP mod_unload PENTIUM4 gcc-3.2'
Alternatively, the following should find the "signature" in existing kernel modules:
  $ find /lib/modules/`uname -r` -name '*.ko' -print | head -1 | xargs strings | grep gcc
  vermagic=2.6.17 SMP mod_unload PENTIUM4 gcc-3.2
In this case a gcc-3.2.X version is required.

Regardless of which method is used to find the correct version, you will need to reconfigure BLCR to use the correct compiler. To do so, rerun configure with the addition of "KCC=/path/to/the/correct/gcc" to the command line to set the compiler used for building BLCR's kernel modules.


Usage Questions

Why do I get this error: "Restart failed: Device or resource busy"?

This is because a resource needed into order to restart the process is already in use. The most common problem is that another process already exists with the same pid (process ID)--the operating system will not allow you to create two programs with the same pid. Very frequently this is because a user is trying to 'restart' a process from a checkpoint, when the original process they took the checkpoint of is still running!

If you are unlucky enough that some other, unrelated process has grabbed the PID of your application, you must figure out some way to get rid of that process. If you own the process, you can of course simply kill it (or checkpoint it!). Otherwise, consider becoming root, or consulting your system administrator. BLCR will not kill another process for you (this 'feature' would raise certain security issues).

Why can I restart jobs on the original machine they ran on, but not a different node in my cluster?

You should be able to restart a BLCR-checkpointed job on a different node (or set of nodes, for a parallel job), provided that all the nodes involved provide the exact same libraries and other files that your executable needs. By default BLCR does not save the contents of shared libraries that your program uses, nor does it save the contents of any files your program has open()'ed. But so long as all of these libraries and/or files exist on another node, your program should restart fine.

Note that libraries must be exactly the same for a restart to work; if they are not the same size, for instance, restart will not work. If you've installed the same version of the operating system to all of your nodes (and you've updated them all the same way), you would think things would be fine. However, some Linux distributions are using "prelinking", which is a method for assigning fixed addresses for shared libraries to load into executables. Prelinking is a feature which enables applications that use many shared libraries to load faster. The fixed address used by the same library on different nodes is often deliberately randomized (in order to defeat buffer overflow attacks that could otherwise rely on standard libraries being loaded at the same address on every machine with the same OS version). Alas, if the prelinked addresses are different, you will not be able to restart BLCR checkpoints on another node.

The solution for this problem is to disable prelinking on both the source and destination nodes of any process migration before starting any process you may wish to migrate. For most cluster environments, that means disabling it on all nodes before using BLCR for migration. Prelinking is a systemwide setting, so you will need to be root. On Fedora Core 2, at least, the fix is to edit /etc/sysconfig/prelink and set 'PRELINKING=no'. The comments claim that this will cause prelinking to be undone automatically the next night. We've never been patient, and instead "undo" prelinking immediately by running (as root)

  # /usr/sbin/prelink --undo --all
Automating this process for an entire cluster depends on your specific environment.

BLCR 0.7.0 introduced the --save-* family of options to cr_checkpoint to cause the executable and/or shared libraries to be included in the context file. This may significantly increase the size of the context files. Therefore we recommend this approach only if you cannot ensure uniform library versions (w/o prelinking) across the machines you wish to migrate among.

Some versions of glibc use mmap() to load locale data. CentOS 5 and Scientific Linux 5, for example, store the locale information in the file /usr/lib/locale/locale-archive. This file is generated when glibc (specifically, the glibc-common RPM) is installed, and contains archived versions of all of the localization databases. Since the locale-archive file is not included in the RPM, but is generated at install time, the contents of this file can differ between your compute nodes. This causes problems migrating programs that use glibc's locale support, bash being the most notable. If you see problems migrating bash scripts between nodes, you might have this problem.

You can fix this problem in one of three ways. First, you can use the --save-all option to cr_checkpoint to ensure that the correct localization data is loaded at restart time. Second, you can disable localization support completely by using export LANG=C in your shell environment. Finally, we've been successful in copying one version of the /usr/lib/locale/locale-archive across all of your compute nodes. You'll need to update this file every time you update your glibc-common RPM.

Why do I get the error "Restart failed: No such file or directory"?

This error normally means that a file that was open at the time the checkpoint was taken is no longer present (it is either completely gone, or perhaps just not present at the same pathname as it was previously). You should examine your system logs (such as /var/log/messages or dmesg) for an indication of the file that caused the problem. You will probably find a message like one of the following:
    vmadump: mmap failed: /tmp/hsperfdata_[user]/[pid]
Failed to open file '/tmp/foobar'
In the case of files in a directory of the form /tmp/hsperfdata_[user] see the hsperfdata FAQ entry. For other files, there are a a couple of things you might try.

If the file is a temporary created by your application, it is possible that it has been removed when the application terminated. For instance, if you checkpoint an application with the --term option to the cr_checkpoint utility, then SIGTERM was sent to the application, causing it to cleanup before terminating. If this is the case, then passing --kill will cause the uncatchable signal SIGKILL to be sent, thus preventing any cleanups by the application. Of course, if your application ran to completion and removed its temporary file at its normal conclusion, then you are on your own as to how to recover the file.

If you are trying to perform migration of a process from one machine to another, then it is possible that the file exists, but not at the full pathname that was saved by BLCR. This is especially true if network filesystem mountpoints differ between machines. You may be able to work around such issues with symbolic links among directories, but BLCR provides a --relocate option to cr_restart that can easily deal with such situations.

Future versions of BLCR will make it possible to capture the contents of all open files, thus providing a mechanism to eliminate problems of this sort (at the cost of increased context file size, of course).

Why do I get the error "Restart failed: permission denied"?

There are multiple reasons why a restart would fail with this message, but the most common is filesystem permissions. You should examine your system logs (such as /var/log/messages or dmesg) for an indication of the file that caused the problem. You will probably find a message like one of the following:
    vmadump: mmap failed: /var/db/nscd/hosts
Failed to open file '/tmp/foobar'
You should verify that our user has permission to access the file. In the case of files in the directory /var/db/nscd see the NSCD FAQ entry.

Why do I get the error "ERROR: ld.so: object 'libcr_run.so.0' from LD_PRELOAD cannot be preloaded: ignored"?

This error is almost always a symptom that BLCR's shared libraries are not located for one of two reasons: Please see, respectively, the sections "Updating ld.so.cache" and "Configuring Users' environments" in the BLCR Admin Guide for information on resolving these installation issues.

Why do I get "vmadump: mmap failed: /var/db/nscd/[something]" in the system logs?

The files in the directory /var/db/nscd/ are created by the Name Service Cache Daemon (NSCD, for short) and are protected against normal file accesses. When the NSCD is enabled, certain C library operations talk to the daemon which uses filedescriptor passing to allow the application to mmap() these files. Unfortunately, BLCR cannot replay the filedescriptor passing (there is no way to know that a given filedescriptor was obtained in this way). This leaves BLCR to rely on normal filesystem permissions when trying to reestablish the mmap(), which in this case finds insufficient permissions.

C library functions that may use NSCD include host database lookups (such as gethostbyname()), user database lookups (such as getpwuid()) and other database lookup functions. Since use of NSCD may speed these lookups relative to network-based lookups such as NIS, LDAP or DNS, this is generally a good thing.

If you are experiencing the "mmap failed: /var/db/nscd/..." error, you have at least the following three options:

Our thanks to Guy Coates <gmpc AT sanger DOT ac DOT uk> for information on the /etc/nscd.conf settings.
Our thanks to Hongjia Cao <hjcao AT nudt DOT edu DOT cn> for information leading to the implementation of LIBCR_DISABLE_NSCD.

Why do I get "vmadump: mmap failed: /tmp/hsperfdata_[user]/[pid]" in the system logs?

Directories of the form /tmp/hsperfdata_[user]/ are created by Sun's Java runtime environment (JRE), and the individual files in this directory are removed when the application exits. This includes exits due to catchable fatal signals, so if a checkpointed Java application is terminated by such a signal these files will be unavailable for a subsequent restart. You have several options to deal with this issue: Our thanks to Guy Coates <gmpc AT sanger DOT ac DOT uk> for information on the -XX:-UsePerfData JRE option.

Why does my application dies with "Real-time signal 31" (or 32, etc.) when I try to checkpoint it?

This error was possible in older releases of BLCR when an application was not checkpointable. This should not happen in release 0.5.0 or newer and should be reported as a bug if seen.

Why can't I checkpoint my statically linked application?

If you can checkpoint and restart a dynamically linked application correctly, but cannot do so with the same application linked statically, this FAQ entry is for you. There are multiple reasons why BLCR may have problems with statically executables.

Can I use Linux kernel auditing support with BLCR?

We recommend that you avoid auditing when using BLCR. In particular, we've seen odd restart failures (on return to user space) when migrating processes between nodes where auditing is enabled (with /sbin/auditctl -e1) to nodes where auditing is disabled. The migrating processes themselves were not being audited, neither at checkpoint time nor during restart. In some cases, but not all, these restart failures produce warning messages about leaked audit contexts.

More importantly, BLCR does not generate audit records during checkpoint or restart. As far as the auditing code is concerned, all of the BLCR kernel calls are described by ioctl() records. You don't see audit records describing the creation of the context file, the mmap() established during restart, or the file descriptors that are restored. Please contact the mailing list if you need auditing support during checkpoint/restart calls.

Prior to version 0.8.4, the linked_fifo test caused a kernel BUG when audited during restart. If you encounter any other problems, please report a bug to the mailing list.


Additional Resources

Where can I download BLCR?
Where can I find more information?

To download the BLCR software, or for links to all the available information about BLCR, please visit http://ftg.lbl.gov/checkpoint.

Is there a mailing list for BLCR?

There is a mailing list of BLCR developers and some of the users at [email protected] and which is archived:

This list is managed by Mailman. So, to subscribe (or unsubscribe) visit https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/checkpoint.

Where can I report BLCR bugs?

If you think you've found a bug in BLCR, please do let us know about it. There are many kernel-dependencies in BLCR and we could easily have missed testing on a system like yours. We count on user's bug reports to help ensure wide testing coverage.

The BLCR bug database is managed by a Bugzilla, located at http://mantis.lbl.gov/bugzilla.

Before reporting a bug, you are encouraged to search the database to see if a bug report exists for your problem. For some issues a solution can be found in just a day. So, a patch to fix your problem may already be attached to an existing bug report. BLCR is just one of a group of projects managed on this server, so be sure to select product "BLCR" in your queries.