Recovery
Recovery is a broad topic that applies to several layers of a production database system, as well as processes and procedures practiced in the management of a system. It is important to understand which fault detection and recovery mechanisms exist at all layers of an enterprise application. Application-level robustness, network reliability, service dependencies (for example, naming, directory, routing, and so on) can all fail, and thus can all recover.
When examining the recovery of a database server, consider two primary recovery aspects:
Crash recovery, including the ORACLE instance recovery, that does not require data restoration from backup media
Data recovery, such as data loss incurred due to a crash, system fault, or human error that requires data restoration from backup media
The time required to perform data restoration from backup media is determined by several factors, such as the amount of data to be restored, the sustainable data rate of the backup device, and the sustainable data rate of the primary storage media. Data restoration time is measurable and can be benchmarked.
Two critical elements to any production data center best practice are rigorously testing and documenting recovery procedures. The time required for the manual intervention aspect of recovery, along with the human error factor, make documented recovery procedures a must.
Technology advances, coupled with lessons learned and data center best practices, have reduced the number of data restore scenarios. Redundant storage and better data integrity checking are among the key technologies that minimize the need to restore data following a crash. Our recovery exercise in this article uses the scenario of a database server crash and recovery when data restore is not required.
You can view a standalone database as a layered series of components where a hierarchical dependency existseach layer requires that the layer below has completed recovery and is in a functional state. FIGURE 1 shows a standalone database server's layered components.
FIGURE 1 Layered Components of a Standalone Database
Thus, for the typical standalone database server, the recovery process begins at the hardware layer and continues through the operating system, a host-based redundant array of inexpensive disks (RAID) manager (if present), and the database application.
Hardware Recovery
Sun's mid-range and enterprise server systems implement power-on self test (POST) and automatic system restart (ASR) features that improve overall system availability by allowing the system to boot around a failed component. During the boot process (including reboots following system crashes), the system firmware checks the integrity of the hardware through basic integrity checks. It flags a component that fails integrity checking. In this way, the system continues the boot process instead of stopping and requiring service (for example, replacing a hardware component).
The primary variables of hardware recovery time (assuming human intervention is not required) are the system size (number of processors, amount of physical memory, and number of I/O channels) and the POST execution level. The POST firmware can run more or less extensive diagnostic tests, depending on the level (which varies somewhat across product families). The Sun Fire 3800-6800 systems and the Sun Fire 12K and 15K servers all implement POST diagnostic levels. The 3800-6800 POST level nomenclature is different from the 12K and 15K. The former use names, whereas the 12K and 15K uses numbers.
The 3800-6800 POST levels are listed in TABLE 1, and the Sun Fire 12K and 15K POST diagnostic levels are listed in TABLE 2.
TABLE 1 Sun Fire 3800-6800 POST Levels
POST Level |
Description |
init |
Provides fastest POST run. No testing is done; only system board initialization. |
quick |
Tests all system board components. Provides minimal test counts and patterns. No memory testing is done. |
default/max |
Runs all system board tests and patterns. Performs some memory and Ecache testing. |
mem1 |
Performs all default/maximum testing, plus extensive memory and Ecache testing. |
mem2 |
Includes mem1 testing plus DRAM testing. |
TABLE 2 Sun Fire 12K and 15K POST Levels
POST Level |
Description |
7 |
Probes and initializes the hardware at a minimum level. |
16 |
Provides default level, basic testing of CPUs, memory, and I/O paths. |
24 |
Performs additional memory testing. |
32 |
Provides additional CPU and memory pattern testing. |
64 |
Runs testing at the highest normal dialogistic level. Adds more extensive CPU, memory, and error detection testing. |
96 |
Adds all patterns and locations for memory and Ecache testing. |
127 |
Provides maximum level of testing, including memory and Ecache stress testing. |
We reiterate that the size of the system is a significant factor in POST execution time. For example, a Sun Fire 15K with 16 CPUs and 32 Gigabytes of RAM has a considerably shorter POST execution time than a system with 72 CPUs and 576 Gigabytes of RAM, assuming the same POST diagnostic level. TABLE 3 provides some actual examples of execution time.
TABLE 3 POST Execution Time Examples
Sun Fire Platform |
CPUs/RAM |
POST level |
POST time |
4800 |
8/8 GB |
init |
4 minutes |
4800 |
8/8 GB |
default |
13 minutes |
6800 |
12/48 GB |
default |
23 minutes |
6800 |
12/48 GB |
mem2 |
2 hours, 10 minutes |
15K |
72/576 GB |
16 |
29 minutes |
NOTE
The times are example data points and do not constitute a committed execution time for matching systems and configurations.
Generally, in environments where minimizing boot time and recovery time is critical, it is recommended that system administrators consider setting POST diag-levels to something below the default values, for example: init on Sun Fire 3800-6800 systems and 7 on Sun Fire 12K and 15K systems.
The trade-off is that a lower POST level reduces the ability for testing to find a faulty hardware component, which could cause an outage event. If your system is experiencing hardware problems, consider more extensive POST testing to assist in finding the hardware problem causes. However, when setting POST levels higher than default values, be sure that you fully understand the impact. For large systems, the higher testing levels might result in POST runs that take several hours. Higher POST settings should only be in conjunction with your service organization.
Solaris OS Recovery
The factors driving Solaris OS startup time are the following:
System size (hardware configuration)
Number and type of file systems mounted
Number of attached disk drives, or logical units (LUNs) in the case of intelligent storage controllers
Additionally, if a system crash is the result of a Solaris OS panic, the recovery process includes the time required for Solaris OS to generate a kernel core file.
The first phase is the initial core dump, where the file is written to the primary swap device, or a raw device configured with the dumpadm(1M) command. This action happens following the kernel panic, before the system reboots.
The second phase is the execution of the savecore(1M) command, which executes during boot and copies the kernel core file to a specified directory in the file system (for example, /var/adm/cores). The savecore(1M) command executes automatically during boot to the init level 2, from the /etc/rc2.d/S75savecore startup script.
To minimize boot, reboot, and recovery time, it is recommended that all required file systems have logging enabled. In Solaris 7 OE, the universal file system (UFS) provides a logging feature as a mount(1M) option. A logging file system typically does not require a time-consuming fsck(1M) command following a crash. Depending on the number and size of file systems, logging can significantly reduce restart time.
Also, it is advised for Solaris 8 OE and newer versions that you use the dumpadm(1M) command to configure a separate raw partition to hold a kernel dump file. This configuration has several advantages, including the ability to run savecore(1M) after a system is rebooted, instead of requiring savecore(1M) to complete before system initialization can finish.
RAID Manager Recovery
We include a section on host-based RAID software because of the impact it can have on system recovery time. The two volume RAID software packages on SPARC processor-based Solaris systems are Veritas Volume Manager (VxVM) from Veritas Corporation and Solaris Volume Manager (SVM), which is included with Solaris OS distributions, beginning with Solaris 8 OE.
Both volume managers maintain data areas for the storage of RAID configuration information and device state. During bootstrap, including recovery reboot, the volume manager software must insure the integrity of the configured RAID volumes. The time required to do this scales up linearly with the number of configured devices: volumes in the case of VxVM software and meta devices in the case of SVM software.
From a configuration perspective, it is recommended that a sufficient number of data regions for the volume and state information are configured equally across available disks and controllers. The SVM software uses meta device state databases, and VxVM software implements private regions. When configuring the system, evenly spread the meta device state databases or VxVM private regions across disks and controllers.
ORACLE Recovery
ORACLE recovery is well documented in several places, most notably the ORACLE 9i Database Performance Guide and Reference. ORACLE recovery is organized into two steps:
Cache recovery
Transaction recovery
Cache recovery, also known as rolling forward, is the application of all committed and uncommitted changes from redo log records to the underlying database files (the actual data blocks). Cache recovery is done first, and ORACLE must complete cache recovery before opening the database (completing the startup process).
Transaction recovery, also known as rolling back, is the application of the rollback segments to the underlying data files. This phase handles uncommitted transactions that were in progress when the crash occurred. Any change that was not committed needs to be undone, and the rollback segments facilitate this by keeping an unchanged copy of the data. Rollback begins after cache recovery is completed and the database is opened (the instance is "up").
Because the cache recovery phase is critical to getting the database back online, the focus of recovery tuning is directed at managing and tuning the roll-forward phase, which is about managing and tuning checkpointing. Checkpointing is an online database event that synchronizes the cached copy of disk blocks to the actual data files on disk. It is the flushing of modified block buffers from memory to disk, which in turn reduces redo log entries that would be required for an instance recovery. Checkpointing occurs regularly when an instance is up, and can be managed through both init.ora parameters and configuration decisions.
Managing checkpointing has the net effect of managing the amount of redo log records required for recovery. This is the essence of ORACLE recovery tuning, because the time required for ORACLE to recover is determined by:
Size of the redo log since the last checkpoint (number of redo log records that need to be applied)
Number of data blocks and the ratio of redo log records to data blocks
Input/Output Operations Per Second (IOPS) system capabilities
ORACLE has invested heavily in improving recovery time through extensive optimizations and improvements in the code, as well as providing features that allow administrators ease and flexibility in establishing predictable recovery times. Recovery-related init.ora tunables are summarized in TABLE 4.
TABLE 4 ORACLE init.ora Recovery Tunables
ORACLE Release |
Features |
8.0.X |
log_checkpoint_timeout, log_checkpoint_interval, recovery_parallelism |
8.0.4 |
parallel_execution_message_size, log_checkpoint_timeout, log_checkpoint_interval, recovery_parallelism |
8.1.X |
fast_start_io_target, parallel_execution_message_size, log_checkpoint_timeout, log_checkpoint_interval, recovery_parallelism Includes incremental (soft) checkpointing. |
9.X |
fast_start_mttr_target, fast_start_io_target, parallel_execution_message_size, log_checkpoint_timeout, log_checkpoint_interval, recovery_parallelism |
Prior to the ORACLE 8.1 releases, only hard checkpointing was implemented, meaning that a checkpoint occurred if the log_checkpoint_timeout or log_checkpoint_interval threshold was reached, or a redo log was full and a log switch was required. In ORACLE 8.1, fast start checkpointing was introduced, with the fast_start_io_target init.ora parameter. Using this parameter, administrators can manage incremental (soft) checkpointing. Information on ORACLE fast start checkpointing is available in the Fast-Start: Quick Fault Recovery in ORACLE publication and on the Web at http://otn.oracle.com/deploy/availability/htdocs/fs_chkpt.html.
ORACLE further improved recovery features in ORACLE 9i with the addition of the fast_start_mttr_target init.ora parameter. In the next section, we examine ORACLE parameters.