- Analysis and High-Level Observations
- Resolving CPU and I/O Bottlenecks Through Modeling and Capacity Planning
- Conclusions
- Recommendations
- I/O Infrastructure Performance Improvement Methodology
- Data Tables
Resolving CPU and I/O Bottlenecks Through Modeling and Capacity Planning
The following paragraphs discuss the activities that comprise the modeling and capacity planning process:
- Establish a Baseline System Performance Database
- Define the Peak Workload Period
- Build a Baseline Model of the System
- Perform "What If" Analysis
Establish a Baseline System Performance Database
The capacity planning process starts by collecting detailed performance data for the workloads on the systems under study. Workloads are logical groups of processes running on the system that represent different applications and/or groups of users. By defining workloads, you can view system performance data in terms that make sense from a business or functional perspective. Collect performance data for a period of time considered adequate to include the peak workload period of the system.
Define the Peak Workload Period
After the performance data has been collected, you can analyze it to define the peak workload period. The criteria used to identify the peak period varies from system to system, depending on the types of workloads being run.
For many workload types, CPU utilization is the metric used to establish the peak period. However, for database servers, disk I/O may be the most important system activity to measure. For network file servers, network load may be the defining value. CPU utilization was used to define the peak workload period for the systems in this case study.
Build a Baseline Model of the System
Once the peak period is known, you can extract the performance data and system hardware configuration for that time interval to be used as input to the system model. The model is based on each resource in the system (treat each CPU, I/O controller, disk drive, and so forth as a service queue, and view the system as a network of these queues). Use the measured peak period performance data to calibrate the model, that is, use the measured data to find the coefficients of the queuing equations and thus define the system in mathematical terms.
When calibrated, the model represents the baseline system as it existed when you collected the performance data. The baseline model shows system resource utilization, throughput, response time, and queue delay (stretch) factors for each workload defined.
Perform "What If" Analysis
With the baseline system model in hand, you can perform "what if" analysis. This analysis projects the workload growth and system configuration changes.
You can increase the amount of work the system is doing to see what the performance impact might be. You can also modify the system hardware configuration to determine the effect. Modification can include adding CPUs, increasing CPU speed, adding I/O controllers or changing their performance characteristics, adding disk drives, changing RAID characteristics, and so forth. In addition, you can combine workloads from different systems on a single system to see how they might perform together. System modeling is a powerful tool for capacity and performance planning, allowing any number of workload intensities and system configurations to be explored in advance of system utilization growth and equipment deployment.
A set of four chart types provide the results of the "what if" analysis:
- Stretch factor
- Response time
- CPU utilization by workload
- Active resource
The "stretch factor" chart shows the amount of queuing/contention in the system for each workload. The minimum value of one means there is no queuing; that is the work is being performed as fast as it is being presented to the system. A value of two or above means that the workload is spending as much time or more waiting in a queue than it is being serviced by the system; this situation should be avoided.
Stretch Factor = (Queue_Time + Service_Time)/(Service_Time)
According to the preceding formula, if the Queue_Time is zero, the stretch factor is one. This should be the case in an ideally tuned system.
The "response time" chart shows "time to completion" requests, which can be theoretically associated and are directly proportional to the actual response time on characterizing the system load.
The "CPU utilization by workload" chart shows which workloads are associated with CPU utilization (user mode).
The "active resource" chart shows utilization percentages for I/O resources.
Each set of charts shows the system configuration running the baseline production workloads during the peak hour. The charts also show multiples of the baseline workload up to the point where the configuration fails to provide adequate capacity. Additional sets of charts for each system show improved hardware configurations that address the deficiencies found in the current production systems.
For the systems in the case study, physical memory size was not an issue. This statement ignores any application memory leak problems. It does, however, take into account the memory upgrades already scheduled for the production systems.
For the systems in the study, network bandwidth should not be an issue if the additional network interfaces are provided on each system, as recommended in the following paragraphs.
Baseline Charts
The following charts establish the baseline for the "what if" analysis in this case study.
Observations
The chart in the preceding figure shows that the stretch factor for the user workload is approximately 6.56, which is about six times the acceptable level. Further analysis verified that, primarily, the following I/O devices contribute to the queue delay:
Controller 4: 4DDEd1, 03F84d1
Controller 13: 36E5d0, 54A8d0, 5762d1
Controller 10: 4405d0, DFE6d0
Controller 7: 3E36d1, 5A19d1, AA3d0
CPU utilization in the user mode is low, under 20 percent, reflecting the stretch factor described previously.
Exercising the Model
The following charts project a 20 percent growth for three periods.
Observations
Simulating 20 percent compound increases in user workload for three periods increased the stretch factor and corresponding response time progressively for the next three periods.
|
Baseline (BL) |
BL + 20% |
BL + 44% |
BL + 72.8% |
Stretch factor |
6.56 |
8.01 |
9.96 |
12.34 |
Response time |
1382 |
1697 (22%) |
2097 (52%) |
2598 (88%) |
Adding CPUs to the Current Configuration
The following charts show the results of adding CPUs to the current configuration.
Observations
Simulating the addition of four CPUs to the current environment showed that adding CPUs has no significant impact on the response time.
|
Baseline |
Adding four CPUs |
Stretch Factor |
6.56 |
6.51 |
Response Time |
1382 |
1383 |
Removing CPUs From the Current Configuration
The following charts show the results of removing CPUs from the current configuration.
Observations
Simulating the removal of four CPUs from the current environment showed that removing CPUs has no significant impact on the stretch factor or response time.
|
BaseLine |
Removing four CPUs |
Stretch Factor |
6.56 |
6.59 |
Response Time |
1382 |
1382 |
Balancing the I/O
The following charts show the result of balancing the I/O.
Observations
Balancing the I/O among the existing devices reduced the stretch factor and response time by one half. Still, the stretch factor is about three times the acceptable level. The next test evaluated the behavior of the model after adding controllers and disks drives and further distributing the I/O. First, one additional Fibre Channel controller with seven 73-gigabyte 10,000-RPM disk drives was configured and the I/O was balanced among the new controller and drives. The stretch factor and response time went down further, but was still twice the acceptable level. Finally, a second additional controller with the same specifications as the first one was configured. This configuration put the stretch factor and response time within the acceptable level.
Projecting 20 Percent Growth for Three Periods After Balancing the I/O
The following charts show the result of projecting 20 percent growth after balancing the I/O.
Observations
Simulating 20 percent compound increases in user workload for three periods increased the stretch factor and corresponding response time progressively for the next three periods. The response time figures were within the acceptable level.
|
Baseline (BL) |
BL + 20% |
BL + 44% |
BL + 72.8% |
Stretch factor |
1.28 |
1.34 |
1.43 |
1.55 |
Response time |
273 |
286 (5%) |
303 (11%) |
330 (21%) |
Adding CPUs After Balancing the I/O
The following charts show the results of adding CPUs after balancing the I/O.
Observations
Adding CPUs after balancing the I/O has absolutely has no impact on the stretch factor and response time. This confirms the accuracy of the model, reflecting correctly a situation where a CPU is added to balanced system.
|
Baseline |
Adding four CPUs |
Stretch factor |
1.28 |
1.28 |
Response Time |
273 |
275 |
Removing CPUs After Balancing the I/O
The following charts show the results of removing CPUs from the current configuration after balancing the I/O.
Removing CPUs after balancing the I/O has no impact on the stretch factor or response time. The user mode utilization increases substantially, suggesting the approach of the lower limit for the number of CPUs for the current load.
|
Baseline |
Removing four CPUs |
Stretch Factor |
1.28 |
1.29 |
Response Time |
273 |
272 |