User-Level Checkpointing Deployment
The submission of a checkpoint job to a Sun GE environment is similar to the submission of a regular job, with the addition of the following options to the qsub(1) command:
-ckpt checkpoint_env_name
-c [m|s|n|x]
FIGURE 6 shows how a checkpointing job is submitted.
FIGURE 6 Submitting a Checkpointing Application to the Sun GE Environment
In the Sun lab experiment, the -c x option was used because the job was to be checkpointed only when it was suspended. The Sun GE software provides other possibilities, and you should consult the qsub(1) man page to find out more about what behavior is desired for your specific application.
Migration of Checkpointing Jobs
The Sun GE software provides several ways to initiate the job migration capability. FIGURE 7 shows the framework of the migration feature. In the Sun lab experiment, the job suspension and the queue suspension to trigger the job migration were tested.
FIGURE 7 Migrating a Checkpointing Application
You can use the following procedure to apply job migration for a checkpointing application.
To Migrate a Job Using the Sun GE Software
Type the following qsub(1) command:
Use the qmon graphical window to monitor the job execution on a particular queue.
Open qmon the Job Control window, and suspend the job.
qsub -ckpt condor_ckpt -c x ...
FIGURE 8 Job Control Window
The job then shows up on the queue of a second executable host.
Suspend the job on the second host.
The job should be migrated to the queue of the first execution host and be successfully completed. The migration feature was also tested with the queue getting suspended, instead of the job. The job migration also completed successfully in this case.
Condor User-Level Checkpointing Limitations
The Condor user-level checkpointing libraries have some limitations on jobs that it can transparently checkpoint and migrate. The following list contains some of the limitations:
Multiprocess jobs are not supported.
Interprocess communication is not supported.
Network communication must be short.
Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed.
Alarms, timers, and sleep calls are not allowed.
Multiple kernel-level threads are not supported.
Memory mapped files are not supported.
File locks are allowed, but they are not retained between checkpoints.
All files must be opened read-only or write-only.
A fair amount of disk space must be available on the submitting machine for storing checkpoint images.
This includes system calls such as fork(), exec(), and system(). Consequently, MPI programs are not supported.
This includes pipes, semaphores, and shared memory.
A job may make network connections using system calls, such as socket(), but a network connection left open for long periods will delay checkpointing and migration.
These signals are reserved by the Condor system. Sending or receiving all other signals is allowed.
This includes system calls such as alarm(), gettimer(), and sleep().
However, multiple user-level threads are supported.
This includes system calls such as mmap() and munmap().
A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning, but not an error.
A checkpoint image is approximately equal to the virtual memory consumed by a job while it runs. If disk space is short, a special checkpoint server can be designated for storing all of the checkpoint images in a pool.