Berkeley Lab Checkpoint/Restart (BLCR) User's Guide

This guide describes how to use Berkeley Lab Checkpoint/Restart (BLCR) for Linux. For information on installation, configuration, and maintenance of BLCR, please see the companion BLCR Administrator's Guide.

1. About Berkeley Lab Checkpoint/Restart

Checkpoint/Restart allows you to save one or more processes to a file and later restart them from that file. There are three main uses for this:

Scheduling: Checkpointing a program allows a program to be safely stopped at any point in its execution, so that some other program can run in its place. The original program can then be run again later.
Process Migration: If a compute node appears to be likely to crash, or there is some other reason for shutting it down (routine maintenance, hardware upgrade, etc.), checkpoint/restart allows any processes running on it to be moved to a different node (or saved until the original node is available again).
Failure recovery: A long running program can be checkpointed periodically, so that if it crashes due to hardware, system software, or some other non-deterministic cause, it can be restarted from a point in its execution more recent that starting from the beginning.

Berkeley Lab Checkpoint/Restart (BLCR) provides checkpoint/restart on Linux systems. BLCR can be used either with processes on a single computer, or on parallel jobs (such as MPI applications) which may be running across multiple machines on a cluster of Linux nodes.

Note: Checkpointing parallel jobs requires a library which has integrated BLCR support. At the present time, many MPI implementations are known to support checkpoint/restart with BLCR. Consult the corresponding BLCR FAQ entry for the current list.

2. Checkpoint/restarting within a BLCR-aware batch control system

One way to use BLCR is with a batch scheduler system (a.k.a. "job controller", "queue manager", etc.) that knows how to use the BLCR tools to checkpoint and restart the jobs under its control. You can simply tell such a system to "suspend" or "checkpoint" a job, and then later to "resume" or "restart" it.

BLCR has been integrated with several batch systems, to differing degrees. Please see the corresponding BLCR FAQ entry for the current list.

The rest of this document assumes that your batch scheduler does not have built-in support for BLCR. In this case you will manually run the BLCR commands needed to checkpoint/restart your jobs.

Note: this does not mean that you cannot checkpoint/restart your applications if you use a batch system without built-in support for BLCR. It simply means that you have to do your checkpoints/restarts manually as described in the remainder of this document. To the batch system, a job that is checkpointed and terminated manually simply looks like a job that has "completed". A restart of an application looks like a "new" job.

3. Checkpointing Jobs with the BLCR command-line tools

3.1 Make sure BLCR is installed and loaded

This guide assumes that BLCR has already been successfully built, installed, and configured on your system (presumably by you or your system administrator). One easy way to test this is to use the 'lsmod' command to see if the BLCR kernel module is loaded on the node(s) that your program will run on:

    % /sbin/lsmod | grep blcr
    blcr                   47508   0
    blcr_imports            7808   1 blcr

If you don't see these two modules in the output of 'lsmod', then BLCR is not yet running on your system. Consult the BLCR Administrators Guide for instructions on building and installing BLCR.

3.2 Make sure your environment is set up correctly

You must ensure that the BLCR commands, libraries and manual pages can be found in your shell.

Try running

    % cr_checkpoint --help

If 'cr_checkpoint' cannot be found, you need to modify your 'PATH' to include the directory where 'cr_checkpoint' lives. You will probably also want to modify your 'LD_LIBRARY_PATH' variable to contain the directory where 'libcr.so' lives, and add the BLCR man directory to your 'MANPATH'.

Setting up your environment with 'modules'

If your system uses the Environment Modules system to manage software packages, you may be able to get all of your needed environment settings simply by entering something like

    % module add blcr

However, there is no requirement that 'blcr' is the name of the module you'll need; your administrator may have given it a different name ('checkpoint', etc.). Or s/he may have neglected to add BLCR to the set of packages managed by modules, in which case you'll need to use the 'manual' technique below.

Manually setting up your environment

Note that this is generally only required if your system administrator has neither installed BLCR in a well-known location nor modified the system-wide defaults for environment variables. However, if your shell startup files override the system-wide defaults, this step may still be necessary.

To manually set up your environment for BLCR, the first thing you need to know is where it has been installed. By default, BLCR installs into the '/usr/local' directory tree, but your system administrator may have put it elsewhere by passing '--prefix=PREFIX' when BLCR was built (where PREFIX can be any arbitrary directory). See your system documents, or try commands such as 'locate cr_checkpoint' or 'find'.

Once you have determined where BLCR is installed, enter the following commands (depending on which type of shell you are using), replacing PREFIX with the value specified for the '--prefix' option used when configuring BLCR.

To configure a bourne-type shell (such as 'bash' or 'ksh'):

    $ PATH=$PATH:PREFIX/bin
    $ MANPATH=$MANPATH:PREFIX/man
    $ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:PREFIX/lib:PREFIX/lib64
    $ export PATH MANPATH LD_LIBRARY_PATH

To configure a csh-type shell (such as 'csh' or 'tcsh'):

    % setenv PATH ${PATH}:PREFIX/bin
    % setenv MANPATH ${MANPATH}:PREFIX/man
    % setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:PREFIX/lib:PREFIX/lib64

These example assume a "multilib" system with both /lib and /lib64 directories. If your system lacks one of these two directories, the corresponding colon-separated entry may be omitted from the value of LD_LIBRARY_PATH.

The above examples set the PATH, MANPATH and LD_LIBRARY_PATH variables in your current shell only. It is strongly recommended that you make these settings permanent, to make these settings affect future sessions or windows. To do this, you must add the example commands to your shell's start up files. For a single user of BLCR, you should add the appropriate set of commands to the shell startup files in your home directory ('.bashrc' for bash, '.profile' for other bourne-type shells, or '.cshrc' for csh-type shells). For a system-wide installation, add the bourne shell commands to '/etc/bashrc' and '/etc/profile' and the csh commands to '/etc/cshrc'.

4. Checkpointing/restarting applications on a single machine

4.1Types of applications supported

BLCR currently supports:

Single threaded applications
Multithreaded applications using the NPTL implementation of pthreads (NOTE: BLCR 0.7.0 dropped support for the older LinuxThreads implementation)
Process trees, meaning a process and all its "reachable" descendants (excluding those who's parent has exited)
Process groups (as defined by POSIX), which typically means a command pipeline launched by a shell (e.g. "cat foo bar | sort")
POSIX sessions, which typically means a login shell and all its descendants or a batch job

However, certain applications are not supported because they use resources not restored by BLCR:

Applications which use sockets (regardless of address family). If a checkpoint is taken of a process with open sockets, they will not be restored when the process is restarted. Applications or libraries may register a checkpoint callback to manage socket connections to re-open them at restart time (this is how MPI libraries typically work with BLCR), but the core BLCR checkpointer does not directly support restoring sockets. In releases prior to 0.5.0 would fail at checkpoint time if sockets were open.
Applications which use character or block devices (e.g. serial ports or raw partitions). At restart time any devices will appear to have been closed. As with sockets, code that is BLCR-aware may choose to take its own measures to deal with devices.
Applications which use SystemV IPC mechanisms including shared memory, semaphores and message queues. As with sockets, code that is BLCR-aware may choose to take its own measures to deal with these resources.
Others - this list is not exhaustive. If you have questions about specific resources, see the section "For more information" for contact information.

4.2 Making an application checkpointable

To be checkpointed successfully with BLCR, an application must contain some library code that BLCR provides. There are several ways of ensuring this. Note: in the examples that follow we will use "BLCR_LIBDIR" to denote the directory where the BLCR appropriate libraries are installed. If PREFIX is the root of your BLCR install, then this will typically be either "PREFIX/lib" or "PREFIX/lib64" depending on various conditions.

Start your executable via the with the 'cr_run' command:
```
        % cr_run your_executable [arguments]
```
'cr_run' loads the BLCR library into your application at startup time. You do not need to modify an application to have it work with 'cr_run'. However, 'cr_run' is limited to dynamically linked executables; statically linked executables will need to use one of the approaches listed below.
Link your application with BLCR's 'libcr_run'. For instance, you could make a simple 'hello world' C program checkpointable via
```
        % gcc -o hello hello.c -LBLCR_LIBDIR -lcr_run -u cr_run_link_me
```
Your application will now look for the BLCR library whenever it starts up, but note that this does not mean it will automatically be found: you may also need to set your 'LD_LIBRARY_PATH' environment variable as described earlier (or read the 'ld' man page for information on '-rpath'). If using this approach, please read the "Cautionary linker notes", below.
Link your application with BLCR's 'libcr'. While linking libcr_run (see above) works for simple 'hello world' applications, it doesn't allow the application any sort of control over what gets checkpointed. To get such control, link to 'libcr' with a command such as
```
        % gcc -o my_app my_app.c -LBLCR_LIBDIR -lcr
```
As with the libcr_run example above, you may need to ensure correct settings in you environment to allow the library to be found when the application starts. Note that '-ldl' and '-lpthread' might also be required to satisfy dependencies in libcr.
Link your application with some library which uses BLCR. For instance, if your MPI library has been made BLCR-aware, it will cause libcr to be loaded, and so simply linking with mpicc is enough to make your application checkpointable.
Use run-time loading to dynamically link 'libcr' or 'libcr_run' (see the 'dlopen' man page). This mechanism can be used for building applications or libraries that must work both with and without BLCR present on a system.
Force the 'libcr_run.so' dynamic library to do loaded at startup by adding it's full pathname to the 'LD_PRELOAD' environment variable (or just the filename if the directory is listed in 'LD_LIBRARY_PATH'). We do not recommend setting this in your environment in general (via 'export' or 'setenv'), since certain programs may interact poorly with the BLCR library logic. Instead, we recommend that you use a command like
```
        % env LD_PRELOAD=BLCR_LIBDIR/libcr_run.so.0 your_executable [arguments ]
```
This is essentially how 'cr_run' works.
Finally there exists the option to link an application in such a way that it will simply disappear from a checkpoint. While this case may sound odd, there are actual instances of "helper" processes that can be omitted from a multi-process checkpoint and re-fork()ed when restarting. For this purpose, one can link libcr_omit. The link command is identical to that for libcr_run, changing the two instances of "run" to "omit":
```
        % gcc -o goodbye goodbye.c -LBLCR_LIBDIR -lcr_omit -u cr_omit_link_me
```

If you your application does not link in BLCR's library via one of the mechanisms listed above, then any attempt to checkpoint it will fail gracefully In BLCR releases prior to 0.5.0 this situation would cause the program to die unless you handled BLCR's real-time signal explicitly.

Cautionary linker notes

The ELF linker for Linux will normally include ELF DT_NEEDED tags for all dynamic libraries given on the link command line. However, one can pass the --as-needed linker option to generate such tags only for those libraries that supply some symbol referenced elsewhere. However, the intended use of libcr_run and libcr_omit is linking with executables that have not been modified to reference any BLCR symbols. It is for this reason that the instructions above contain "-u cr_run_link_me" and "-u cr_omit_link_me". These each serve to artificially create an undefined reference to a symbol which we've supplied in the BLCR libraries for exactly this purpose, thus ensuring the appropriate BLCR library is linked.

BLCR does not currently build static libraries (e.g. libcr.a) by default. However, if your installation was configured to do so, then a link command that includes -static will encounter the same problem detailed above for the --as-needed linker flag (and for the same reason: libraries are not linked unless they satisfy an undefined symbol reference.) Since the problem is the same, it should come as no surprise that the solution is too: pass the appropriate "-u symbol" linker flags. If your installation has gone so far as to install only static libraries (not shared), then you will face this problem even without the -static flag.

As a result of the issues above, we strongly recommend that authors of Makefiles that link libcr_run or libcr_omit uniformly use "-lcr_run -u cr_run_link_me" and -lcr_omit -u cr_omit_link_me". This will guard against the possibility of user-provided LDFLAGS (or equivalent) that could trigger (silently) the omission of the desired BLCR library from the generated executable.

While one would not normally link libcr without referencing any functions in it, libcr does provide the symbol cr_link_me for completeness.

Finally, note that the symbol you specify after -u is also a good way to check that you've linked in BLCR. Assuming you've not stripped your executable, the following command should find the symbol:

        % nm my_app | grep _link_me

If you don't see any output then you've probably not linked in BLCR's library. Recheck your link command, especially the value you used for BLCR_LIBDIR,

4.3 Checkpointing the process

To checkpoint a process, simply run

    % cr_checkpoint PID

where PID is the application's process ID.

By default, 'cr_checkpoint' saves a checkpoint, and then lets your application continue running, which is useful for saving the state of a process in case it later fails. However, you may terminate the process after it has been checkpointed by passing the '--term' flag:

    % cr_checkpoint --term PID

This causes a 'SIGTERM' signal to be sent to the process at the end of the checkpoint. To send a different signal to your process at the end of the checkpoint, you can pass any arbitrary signal number using the '--signal=N' flag or one of the related arguments documented in the manpage for cr_checkpoint.

By default BLCR interprets the final argument (PID in the examples above) as the process id of a (potentially multi-threaded) process to checkpoint, along with all of its children (and their children, etc.). However, there are options to request a checkpoint of a different scope for the checkpoint:

    % cr_checkpoint --pgid PGID
    % cr_checkpoint --sid SID
    % cr_checkpoint --tree PID
    % cr_checkpoint --pid PID

These four examples request, respectively, checkpoints over the scope of a process group, a session, a process tree (the default), and a single process. The PGID is a process group identifier and SID is a session identifier. Here we take the terms "process group" and "session" to mean the set of processes having the given pgid or sid. In most cases the pgid or sid is just the pid of the process group leader or session leader. When in doubt, try using the '-j' option to 'ps' to show PGID and SID columns. The '--tree' flag to 'cr_checkpoint' requests a checkpoint of the process with the given pid, and all its descendants (excluding those who's parent has exited and thus become children of the 'init' process). This is the same as the grouping shown by the output of the 'pstree' command.
In BLCR releases prior to 0.6.0 the --pid option was the default.

When checkpointing multiple processes using one of the scope arguments other then --pid, all the pipes among the processes are saved and restored. Pipes to/from processes not within the checkpoint scope are not saved (these will be replaced at restart time by the stdin or stdout of the 'cr_restart' process).

While 'cr_checkpoint' will accept a process group or session identifier as a scope argument, BLCR does not currently restore the pgid or sid of restarted processes. Instead restored processes inherit the pgid and sid of the 'cr_restart' process. This is considered a sane default because an unmodified parent (such as a shell) of 'cr_restart' would lose job control over the processes if these identifiers are restored. A future BLCR release will include the ability to request restore of these identifiers.

Files that contain checkpoints are called context files. By default, they are named 'context.ID', where ID is the pid, pgid or sid that was checkpointed, and are stored in the current working directory of the 'cr_checkpoint' process. You may specify an alternate name and location of the context file via the '--file' option.

There are a number of other options that 'cr_checkpoint' provides. See the man page (or 'cr_checkpoint --help') for details.

4.4 Restarting the process

To successfully restart from a context file, certain conditions must be met:

The PIDs of processes in the context file must not be in use, OR you may specify --no-restore-pid to obtain new pids (but see the cr_restart man page for limitations of this approach).
The original executable must be available: either it must exist with its contents unchanged, OR you may specify any of of the options --save-exe, --save-private or --save-all to cr_checkpoint to include the executable in the context file.
All shared libraries used by the executable must be available: either they must exist with their contents unchanged, OR you may specify either of of the options --save-private or --save-all to cr_checkpoint to include the shared libraries in the context file.
Because BLCR saves and restore most open files "by reference" (storing pathnames rather than file contents), the following should be true of files that were open() or mmap()ed when the checkpoint was taken, though certain applications may be more tolerant than these rules imply:
- Files must exist at their original paths, OR you may use the --relocate option to cr_restart to specify alternative location(s) for files. It is permissible to use a copy of the original file (for instance when performing migration).
- File and directories must have permissions that permit them to be opened for the original access modes.
- Files open for reading should not be modified relative to their contents when the checkpoint was taken in any way observable by the application.
- Files that are open for writing which are written in an append-only manner (regardless of the O_APPEND flag) may have data appended after the checkpoint was taken. Such data will be truncated away when the application is restarted.
- Files that are open for both reading and writing should be restored to exactly their state at the time of the checkpoint.
An exception to the open/mmaped files rules above exists for files already unlinked (deleted) at the time the checkpoint is taken. When BLCR encounters such a file, its entire contents is saved in the checkpoint context file and is later restored as a new (also unlinked) file.
You may specify --save-all to cr_checkpoint to include all mmap()ed file in the context file and thus eliminate the requirements above with respect to mmap()ed files only -- this option does not do anything for files that have been open()ed.

Of these requirements, BLCR is only able to verify the availability of the PIDs and the existence and permissions of the executable, libraries and open files. Failure to satisfy those constraints will lead to an explicit failure from BLCR. Violation of the rules against modification to any files will not be detected by BLCR and the resulting effects on the restarted application are unpredictable.

You may restart a program on a different machine than the one it was checkpointed on if all of these conditions are met (they often are on cluster systems, especially if you are using a shared network filesystem), and the kernels are the same. The restriction on executables and their shared libraries being the same can be a problem for systems using prelinking; see the BLCR FAQ for information on dealing with systems that prelink.

You can restart a process by using 'cr_restart' on its context file:

    % cr_restart context.15005

The original process will be restored, and resume running in the exact state it was in at checkpoint time. Note that this includes restoring its process ID, so you cannot restart a program unless the original copy of it has exited (otherwise 'cr_restart' will fail with a message that the PID is already in use). You may restart a process from a particular context file as many times as you wish. The context file is not automatically removed at any point, so you should delete it if/when it is no longer useful to you.

5. Checkpointing/restarting an MPI application

The best source of information on dealing with any BLCR-aware MPI implementation is the documentation provided with the MPI, or the mailing lists for the MPI.

6. For more information

For more information on Checkpoint/Restart for Linux, visit the project home page: http://ftg.lbl.gov/checkpoint, and/or check out our answers to Frequently Asked Questions about BLCR. When those resources don't answer your questions, you may e-mail [email protected] for help.