Conclusions
TeamQuest reports 1300 IOPS on the system during the peak period. Tests with VERITAS VxBench indicate a total capacity of the I/O subsystem to sustain more than 11,000 IOPS. Based on this information, the underlying architecture is neither the limiting nor the contributing factor for high I/O CPU utilization and, therefore, about 12 percent of the total I/O subsystem capacity is utilized during the peak period. This corresponds with the observation that a few controllers have a very high utilization rate during the peak period.
The capacity planning model indicates that the database server is running above the acceptable limits of resources contention. The infrastructure could deliver, according to the model, about four times better response time if the load on the I/O subsystem can be alleviated and/or distributed over more I/O controllers and devices.
The following table shows that increasing the workload with the current infrastructure and application environment further degrades the response time.
|
Baseline (BL) |
BL + 20% |
BL + 44% |
BL + 72.8% |
Stretch factor |
6.56 |
8.01 |
9.96 |
12.34 |
Response time |
1382 |
1697 (22%) |
2097 (52%) |
2598 (88%) |
In fact, response time degradation is directly proportional to the workload and grows faster as the workload increases. Several simulations were performed on the capacity planning model to balance the I/O subsystem. Balancing the I/O among the existing devices reduces the stretch factor and response time by one half, which is considered a very good achievement.
To further reduce the stretch factor, the behavior of the model after adding new controllers with disks drives was evaluated. First, one additional Fibre Channel controller with seven 73-gigabyte 10,000-RPM disk drives was configured and the I/O was balanced among the new controller and drives. The stretch factor and response time went down further, but was still twice the acceptable level. Finally, a second additional controller with the same specifications as the first one was configured. This configuration put the stretch factor and response time within the acceptable level.
The following table shows, after balancing the I/O, the system stretch factor drops below the recommended thresholds and the response time grows moderately with workload increases.
|
Baseline (BL) |
BL + 20% |
BL + 44% |
BL + 72.8% |
Stretch factor |
1.28 |
1.34 |
1.43 |
1.55 |
Response time |
273 |
286 (5%) |
303 (11%) |
330 (21%) |
In summary, although substantial performance improvement can be achieved by balancing the I/O among existing devices, complete optimization based on stretch factor analysis was only achieved by adding and then balancing the I/O on devices. Compared to the response times observed, after balancing the I/O and optimizing stretch time, the infrastructure could support the 20 percent workload increase for the next three growth periods.
According to the model projections, adding or removing some CPUs would have minimal impact on performance.
Adding other resources to the environment was simulated but, according to the model, application performance improvement could only be achieved by first addressing the I/O resource contentions.
One final conclusionafter implementing the simulated results and suggestions in this report, the customer was able to increase IOPS from 1300 to 4800 and enhance the response time and throughput considerably.