- Background
- Availability
- Recovery
- ORACLE Cache Recovery Tuning
- Recovery and Performance Measurements
- Best Practices
- About the Author
- Acknowledgments
- Related Resources
- Ordering Sun Documents
- Accessing Sun Documentation Online
Recovery and Performance Measurements
To better understand the recovery capabilities on SPARC processor-based Solaris systems running ORACLE, we designed and conducted several lab experiments that involved building an ORACLE database running an online transaction processing (OLTP) workload. We then injected a fatal fault that caused the system to crash. From this experiment, we measured recovery time. We ran two types of tests:
Fault injection tests. We introduced a fatal fault while the system was under load. We measured the recovery times when the system and database were restarted.
Performance tests. We completed OLTP runs to assess the performance impact of recovery tuning.
We used the following two systems in the experiment:
System A: Sun Enterprise Ultra 4000 with 8 CPUs, 4 gigabytes of RAM, and 112 disks; running Solaris 8 OE and ORACLE 8.1.7. The Solaris Volume Manager software was used to configure RAID devices.
System B: Sun Fire 6800 with 24 CPUs, 196 gigabytes of RAM, and 540 disks; running Solaris 9 OE and ORACLE 9.2. The Veritas Volume Manager software was used to configure RAID devices.
To cause a fatal fault, we executed a simple script as root to force a kernel panic. The system A results are summarized in TABLE 5.
TABLE 5 System A Test Results
Throughput |
Chkpnt |
% perf |
Recovery Time |
Recovery Speedup |
7965.90 |
None |
|
1624 |
|
7681.70 |
1800 |
3.57 |
1824 |
0.89 |
6154.70 |
900 |
22.74 |
1011 |
1.61 |
6054.87 |
600 |
23.99 |
981 |
1.66 |
5974.20 |
300 |
25.00 |
340 |
4.78 |
5690.70 |
150 |
28.56 |
249 |
6.52 |
The table heading definitions are as follows:
Throughput is the OLTP workload performance, in transactions per minute.
Chkpnt is the checkpoint interval in seconds.
%perf is the performance regression from the base run (no checkpointing).
Recovery Time is the amount of time, in seconds, ORACLE recovery required.
Recovery Speedup is how much faster the result was versus the no-checkpoint case.
NOTE
Note that this set of runs was done using hard checkpointing only. Checkpoints were issued at desired intervals through a script.
Our data shows an extreme case; extreme because we used hard checkpointing, and the system was I/O constrained. However, it does give a clear indication that recovery time can be substantially improved (by a factor of 6), albeit in this case with a substantial trade-off in performance (28% regression). The primary lesson from this run is that it's about IOPS. Tuning an I/O-constrained system to meet aggressive recovery goals potentially degrades performance when hard checkpointing is used.
We then moved the test to a much larger system (System B), where a larger database was built, and we used ORACLE 9.2.
The primary testing involved measuring improvements using recovery_parallelism and increasing the message buffer size via parallel_execution_message_size. TABLE 6 shows the results.
TABLE 6 System B Test Results
No RP/2k |
24/2k |
24/4k |
24/8k |
24/16k |
24/32k |
24/64k |
43m59s |
33m04s |
26m16s |
23m18s |
22m40s |
22m25s |
22m31s |
The column headings indicate the parallelism level and the message size. The times are the duration of the ORACLE recovery phase, from the beginning of crash recovery through the database open completion. The first column has no recovery parallelism and a 2 kilobytes message size. Subsequent runs used 24 recovery processes, and only the message size was varied.
Our data shows significant improvements in recovery through parallelism, a 25% reduction in recovery time. Further improvements were seen with message size increases. Increasing to an 8 kilobytes message size (with parallelism) reduced recovery time by half. Note that these tests were done without turning on the fast start mechanism, thus there was not an application performance tradeoff in making these changes.
NOTE
Message size adjustments affect parallel query slaves. A workload running parallel queries was not tested, so performance impact is not known.
In our case, the optimal message size for tuning was 8 kilobytes. Beyond that, improvements in recovery time were negligible.
We ran another set of test after tuning fast_start_mttr_target, to measure the impact on performance. TABLE 7 summarizes the results.
TABLE 7 System B Alternate Tuning Test Results
MQTh |
fast_start_mttr |
% of base |
125927.13 |
0 |
|
115580.67 |
300 |
91.78 |
116199.47 |
600 |
92.28 |
117947.8 |
900 |
93.66 |
117574.8 |
1200 |
93.37 |
Our data shows a very reasonable performance tradeoff for aggressive recovery settingsjust about 8% performance regression in the worst case.
On the recovery side, ORACLE 9.2 consistently meets or exceeds MTTR target settings, with the exception of the 300 setting, but that was only off by about 6%. At 300, ORACLE startup took 318 seconds. We suspect that the 300 target was not met because this was the first test run, and Oracle MTTR statistics were estimates, not actual values. Subsequent values (600 and up) resulted in ORACLE startup times consistently under the target value. Note that all these measurements were done with 24 recovery processes and an 8 kilobytes message size.