This guide describes how to use Berkeley Lab Checkpoint/Restart (BLCR) for Linux. For information on installation, configuration, and maintenance of BLCR, please see the companion BLCR Administrator's Guide.
Note: Checkpointing parallel jobs requires a library which has integrated BLCR support. At the present time, many MPI implementations are known to support checkpoint/restart with BLCR. Consult the corresponding BLCR FAQ entry for the current list.
BLCR has been integrated with several batch systems, to differing degrees. Please see the corresponding BLCR FAQ entry for the current list.
The rest of this document assumes that your batch scheduler does not have built-in support for BLCR. In this case you will manually run the BLCR commands needed to checkpoint/restart your jobs.
Note: this does not mean that you cannot checkpoint/restart your applications if you use a batch system without built-in support for BLCR. It simply means that you have to do your checkpoints/restarts manually as described in the remainder of this document. To the batch system, a job that is checkpointed and terminated manually simply looks like a job that has "completed". A restart of an application looks like a "new" job.
This guide assumes that BLCR has already been successfully built, installed, and configured on your system (presumably by you or your system administrator). One easy way to test this is to use the 'lsmod' command to see if the BLCR kernel module is loaded on the node(s) that your program will run on:
% /sbin/lsmod | grep blcr blcr 47508 0 blcr_imports 7808 1 blcrIf you don't see these two modules in the output of 'lsmod', then BLCR is not yet running on your system. Consult the BLCR Administrators Guide for instructions on building and installing BLCR.
Try running
% cr_checkpoint --helpIf 'cr_checkpoint' cannot be found, you need to modify your 'PATH' to include the directory where 'cr_checkpoint' lives. You will probably also want to modify your 'LD_LIBRARY_PATH' variable to contain the directory where 'libcr.so' lives, and add the BLCR man directory to your 'MANPATH'.
If your system uses the Environment Modules system to manage software packages, you may be able to get all of your needed environment settings simply by entering something like
% module add blcrHowever, there is no requirement that 'blcr' is the name of the module you'll need; your administrator may have given it a different name ('checkpoint', etc.). Or s/he may have neglected to add BLCR to the set of packages managed by modules, in which case you'll need to use the 'manual' technique below.
Note that this is generally only required if your system administrator has neither installed BLCR in a well-known location nor modified the system-wide defaults for environment variables. However, if your shell startup files override the system-wide defaults, this step may still be necessary.
To manually set up your environment for BLCR, the first thing you need to know is where it has been installed. By default, BLCR installs into the '/usr/local' directory tree, but your system administrator may have put it elsewhere by passing '--prefix=PREFIX' when BLCR was built (where PREFIX can be any arbitrary directory). See your system documents, or try commands such as 'locate cr_checkpoint' or 'find'.
Once you have determined where BLCR is installed, enter the following commands (depending on which type of shell you are using), replacing PREFIX with the value specified for the '--prefix' option used when configuring BLCR.
To configure a bourne-type shell (such as 'bash' or 'ksh'):
$ PATH=$PATH:PREFIX/bin $ MANPATH=$MANPATH:PREFIX/man $ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:PREFIX/lib:PREFIX/lib64 $ export PATH MANPATH LD_LIBRARY_PATH
To configure a csh-type shell (such as 'csh' or 'tcsh'):
% setenv PATH ${PATH}:PREFIX/bin % setenv MANPATH ${MANPATH}:PREFIX/man % setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:PREFIX/lib:PREFIX/lib64
These example assume a "multilib" system with both /lib and /lib64 directories. If your system lacks one of these two directories, the corresponding colon-separated entry may be omitted from the value of LD_LIBRARY_PATH.
The above examples set the PATH, MANPATH and LD_LIBRARY_PATH variables in your current shell only. It is strongly recommended that you make these settings permanent, to make these settings affect future sessions or windows. To do this, you must add the example commands to your shell's start up files. For a single user of BLCR, you should add the appropriate set of commands to the shell startup files in your home directory ('.bashrc' for bash, '.profile' for other bourne-type shells, or '.cshrc' for csh-type shells). For a system-wide installation, add the bourne shell commands to '/etc/bashrc' and '/etc/profile' and the csh commands to '/etc/cshrc'.
% cr_run your_executable [arguments]'cr_run' loads the BLCR library into your application at startup time. You do not need to modify an application to have it work with 'cr_run'. However, 'cr_run' is limited to dynamically linked executables; statically linked executables will need to use one of the approaches listed below.
% gcc -o hello hello.c -LBLCR_LIBDIR -lcr_run -u cr_run_link_meYour application will now look for the BLCR library whenever it starts up, but note that this does not mean it will automatically be found: you may also need to set your 'LD_LIBRARY_PATH' environment variable as described earlier (or read the 'ld' man page for information on '-rpath'). If using this approach, please read the "Cautionary linker notes", below.
% gcc -o my_app my_app.c -LBLCR_LIBDIR -lcrAs with the libcr_run example above, you may need to ensure correct settings in you environment to allow the library to be found when the application starts. Note that '-ldl' and '-lpthread' might also be required to satisfy dependencies in libcr.
% env LD_PRELOAD=BLCR_LIBDIR/libcr_run.so.0 your_executable [arguments ]This is essentially how 'cr_run' works.
% gcc -o goodbye goodbye.c -LBLCR_LIBDIR -lcr_omit -u cr_omit_link_me
If you your application does not link in BLCR's library via one of the mechanisms listed above, then any attempt to checkpoint it will fail gracefully In BLCR releases prior to 0.5.0 this situation would cause the program to die unless you handled BLCR's real-time signal explicitly.
BLCR does not currently build static libraries (e.g. libcr.a) by default. However, if your installation was configured to do so, then a link command that includes -static will encounter the same problem detailed above for the --as-needed linker flag (and for the same reason: libraries are not linked unless they satisfy an undefined symbol reference.) Since the problem is the same, it should come as no surprise that the solution is too: pass the appropriate "-u symbol" linker flags. If your installation has gone so far as to install only static libraries (not shared), then you will face this problem even without the -static flag.
As a result of the issues above, we strongly recommend that authors of Makefiles that link libcr_run or libcr_omit uniformly use "-lcr_run -u cr_run_link_me" and -lcr_omit -u cr_omit_link_me". This will guard against the possibility of user-provided LDFLAGS (or equivalent) that could trigger (silently) the omission of the desired BLCR library from the generated executable.
While one would not normally link libcr without referencing any functions in it, libcr does provide the symbol cr_link_me for completeness.
Finally, note that the symbol you specify after -u is also a good way to check that you've linked in BLCR. Assuming you've not stripped your executable, the following command should find the symbol:
% nm my_app | grep _link_meIf you don't see any output then you've probably not linked in BLCR's library. Recheck your link command, especially the value you used for BLCR_LIBDIR,
% cr_checkpoint PIDwhere PID is the application's process ID.
By default, 'cr_checkpoint' saves a checkpoint, and then lets your application continue running, which is useful for saving the state of a process in case it later fails. However, you may terminate the process after it has been checkpointed by passing the '--term' flag:
% cr_checkpoint --term PIDThis causes a 'SIGTERM' signal to be sent to the process at the end of the checkpoint. To send a different signal to your process at the end of the checkpoint, you can pass any arbitrary signal number using the '--signal=N' flag or one of the related arguments documented in the manpage for cr_checkpoint.
By default BLCR interprets the final argument (PID in the examples above) as the process id of a (potentially multi-threaded) process to checkpoint, along with all of its children (and their children, etc.). However, there are options to request a checkpoint of a different scope for the checkpoint:
% cr_checkpoint --pgid PGIDThese four examples request, respectively, checkpoints over the scope of a process group, a session, a process tree (the default), and a single process. The PGID is a process group identifier and SID is a session identifier. Here we take the terms "process group" and "session" to mean the set of processes having the given pgid or sid. In most cases the pgid or sid is just the pid of the process group leader or session leader. When in doubt, try using the '-j' option to 'ps' to show PGID and SID columns. The '--tree' flag to 'cr_checkpoint' requests a checkpoint of the process with the given pid, and all its descendants (excluding those who's parent has exited and thus become children of the 'init' process). This is the same as the grouping shown by the output of the 'pstree' command.
% cr_checkpoint --sid SID
% cr_checkpoint --tree PID
% cr_checkpoint --pid PID
When checkpointing multiple processes using one of the scope arguments other then --pid, all the pipes among the processes are saved and restored. Pipes to/from processes not within the checkpoint scope are not saved (these will be replaced at restart time by the stdin or stdout of the 'cr_restart' process).
While 'cr_checkpoint' will accept a process group or session identifier as a scope argument, BLCR does not currently restore the pgid or sid of restarted processes. Instead restored processes inherit the pgid and sid of the 'cr_restart' process. This is considered a sane default because an unmodified parent (such as a shell) of 'cr_restart' would lose job control over the processes if these identifiers are restored. A future BLCR release will include the ability to request restore of these identifiers.
Files that contain checkpoints are called context files. By default, they are named 'context.ID', where ID is the pid, pgid or sid that was checkpointed, and are stored in the current working directory of the 'cr_checkpoint' process. You may specify an alternate name and location of the context file via the '--file' option.
There are a number of other options that 'cr_checkpoint' provides. See the man page (or 'cr_checkpoint --help') for details.
To successfully restart from a context file, certain conditions must be met:
You may restart a program on a different machine than the one it was checkpointed on if all of these conditions are met (they often are on cluster systems, especially if you are using a shared network filesystem), and the kernels are the same. The restriction on executables and their shared libraries being the same can be a problem for systems using prelinking; see the BLCR FAQ for information on dealing with systems that prelink.
You can restart a process by using 'cr_restart' on its context file:
% cr_restart context.15005The original process will be restored, and resume running in the exact state it was in at checkpoint time. Note that this includes restoring its process ID, so you cannot restart a program unless the original copy of it has exited (otherwise 'cr_restart' will fail with a message that the PID is already in use). You may restart a process from a particular context file as many times as you wish. The context file is not automatically removed at any point, so you should delete it if/when it is no longer useful to you.
The best source of information on dealing with any BLCR-aware MPI implementation is the documentation provided with the MPI, or the mailing lists for the MPI.