User-Level Checkpointing Migration
Checkpointing is one of the hottest features that HPC sites request from vendors because a lot of HPC codes run for hours and even days, and an interruption during their execution should be able to save the state of the run for later resumption without starting over from the beginning. There are three types of checkpointing:
Kernel level
User level, using checkpointing libraries
Application level
The kernel level checkpointing is provided by the native operating system. The Solaris OE does not presently provide this capability. Application level checkpointing is provided from within the application by adding specific code that allows the application to checkpoint itself. In this paper, we focus on the remaining type of checkpointing, which involves publicly available user-level checkpointing libraries.
Sun Grid Engine Software
The Sun Grid Engine (Sun GE) software, previously know as CODINE, is a distributed resource management software that allows sites to efficiently manage and use the compute resources of machines across their network.
The Sun GE software is available free for download from the Sun Grid Engine website:
The Sun GE source code has also recently been released for the open source public community at the following site:
http://gridengine.sunsource.net
The Sun GE product is Sun's distributed resource management tool for cluster grids. It is responsible for managing and submitting jobs to available compute resources in an individual grid. The Sun GE software maximizes CPU utilization, increasing productivity and return on investment. An enterprise edition version of the Sun GE software, Sun Grid Engine, Enterprise Edition (Sun GEEE), is Sun's resource management software solution particularly targeted at enterprise grids. This new version orchestrates and delivers computational power according to enterprise policies that are set by the organizational technical and management staff. The Sun GEEE software uses these policies to examine the available computational resources within the enterprise grid, then gathers, allocates, and delivers these resources automatically so that highly optimized resource usage is achieved across the enterprise grid. A controlled share of the total computing resources is assigned to groups, users, and departments by using the Sun GEEE software.
Condor Checkpointing Library
The Condor project at the University of Wisconsin is a fully-distributed resource management software project, including a user-level checkpointing library. This library can be used either as an integral part of the Condor system or as a standalone part of another distributed resource management software, such as the Sun GE software product. This article includes descriptions of the standalone library that is used with the checkpointing facility in the Sun GE software.