An HP UX-11i Performance Management Methodology
Although performance management and crisis management (including performance problem resolution) require different techniques and data collection, the same basic methodology can be used for both. Performance management uses the following steps:
-
Assessment
-
Measurement
-
Interpretation and Analysis
-
Identification of Bottlenecks
-
Tuning or Upgrading the System
The flow chart in Figure 3-1 summarizes the performance methodology that will be discussed in detail, step by step.
Figure 3-1. Performance Management Methodology
3.1 Assessment
Assessment often involves asking lots of questions, as the following example shows.
From Bob's Consulting Log I spent the entire morning asking people a lot of questions about the application, the system configuration, and how users perceived the performance problem. At lunch, the person who hired me said he was surprised that I had not yet logged onto the system. I told him my approach is to ask questions first and collect system data second. Later that day, I did obtain some data from the system, but the most important clue to the nature of the problem came from interviewing the users.
These questions are necessary to help you understand your limits as a performance professional, as well as those things that you can change. The following items must be learned during the assessment phase:
-
System configuration
-
Application design
-
Performance expectations
-
Known peak periods
-
Changes in the system configuration or the application
-
Duration of the problem, if applicable
-
The options
-
The politics
3.1.1 System Configuration
System configuration includes both hardware and software configuration. The number of disk drives, how data are distributed on the disks, whether the disks are used with mounted file systems or used in raw mode, the file system parameters, and how the application makes use of themall of these factors are examined during a system configuration assessment. Memory size, the amount of lockable memory, and the amount of swap space are scrutinized in assessing the virtual memory system configuration. You will need to know the processor type and the number in the system. Finally, the kernel (operating system) configuration must be understood, and the values of tunable kernel parameters should be identified. Knowing all these items in advance will make carrying out the various performance management functions much easier.
3.1.2 Application Design
Understanding the design of the application is equally important. It may not be possible for you as a performance specialist to thoroughly understand the design and workings of the application. Optimally, there should be someone with whom the application design can be discussed.
Such things as inter-process communication (IPC) methods, basic algorithms, how the application accesses the various disks, and whether the application is compute-or I/O-intensive are examples of the knowledge you will need. For instance, with relational databases, it is important to understand whether or not the Relational Database Management System (RDBMS) software supports placing a table and the index that points to it on different disk drives. Some RDBMS products support this and some do not; when this capability is present, it is an important tuning technique for improving database performance. Modern design techniques such as creating multi-threaded applications and the deployment of applications on multi-processor systems require new skills to analyze how the application works and its use of system resources.
You should expect consumer complaints and comments to be in terms of the application. However, measurements will be based upon the available metrics, which are mostly kernel-oriented. If you can't translate system measurements into application-specific terms, it will be difficult to explain in a meaningful way what the measurements indicate in terms of necessary changes in the system.
3.1.3 Performance Expectations
You will need to learn the performance expectations of the system's users. It is very important to know, in advance, the measurable criteria for satisfactory performance. This is the only way to know when performance tuning is successful and when it is time to monitor performance rather than actively attempt to improve it. Objective rather than subjective measures must be elicited. Being told that response time must be less than two seconds and is currently five seconds or more is much more useful than being told that performance is "lousy."
Understanding performance expectations is a complicated task. The perception of actual system performance and the definition of satisfactory performance will change depending upon one's perspective and role. Understanding expectations includes the need for eliciting success criteria. In another words, you need to know whether the users or owners of the application and system will agree with you that you have finished tuning the system and application.
3.1.3.1 Response Time
Users of the system typically talk about poor response time. They are mostly concerned with how long it takes to get a response from the system or application once the enter key is pressed. For instance, a user who is entering invoices into an accounts payable system expects to have data verification and data insertion completed within several seconds, at most. Engineers who use a computer-aided design application expect to see the image of the part being analyzed rotated in real-time. Otherwise, they get frustrated with what they consider to be poor performance. Response time is the most commonly used measure of system performance and is typically quoted as one of the following:
-
An average of a fractional second (sometimes called sub second) or seconds
-
A certain confidence level
-
A maximum response time
Typical confidence levels are 90%98%. For example, one would say that the desirable average response time is two seconds, 95% of the time, with a maximum transaction response time of five seconds.
What is a good value for response time? While the automatic answer "It depends" is certainly true, it is useful to have a range of values for response time to use as a guideline. Users of a text editor or word processing package don't want to have to wait to see the characters echoed on the screen as they are typed. In this case, sub-second response time of approximately 250 milliseconds would be considered optimal. In a transaction processing environment one must understand the environment before developing good response time values. Some of the factors are as follows:
-
Transaction complexity
-
Number of users
-
Think time between transactions
-
Transaction types and ratios
Transaction Complexity
This deals with the amount of work which the system must perform in order to complete the transaction. If the transaction is defined as requiring three record lookups followed by an update, that transaction is much more complex than one that is a simple read. The complexity associated with typical Computer-Aided Design (CAD) applications is very high. Many CPU cycles and perhaps disk I/Os are necessary for simple interactive tasks, such as rotating a model of an object on the display.
Number of Users
The number of users influences the sizing of any system that is required to support a given workload and response time.
Think Time
As the think time between transactions increases, more work can be supported by the system in the idle periods between transactions. Heads down environments provide almost no think time between transactions. This is a typical data input environment. Conversely, customer service environments often provide very long think times, since most of time the customer service representative is speaking with the caller by telephone and casually accessing various databases as the telephone conversation proceeds.
Transaction Types and Ratios
It is necessary to look at the types of transactions and the ratios of the number of each transaction type to the total. Read-intensive applications can provide rapid response times to queries. Insert-and particularly update-intensive applications require more CPU cycles and often disk I/Os to complete the transaction. Table 3-1 gives guidelines for acceptable response times.
Table 3-1. Typical Application Response Times (based on the author's experience)
Transaction Type |
Acceptable Response Time in Seconds |
---|---|
Interactive CAD applications |
<1 |
Text editing or word processing |
1/4 |
Read-intensive, low complexity |
<1 |
Read-intensive, medium to high complexity |
12 |
Update-intensive, low to medium complexity |
5 |
Update-intensive, high complexity |
515 |
Long think-time environments |
23 |
Batch run |
N/A |
Users perceive performance as poor when update response time exceeds 5 seconds and when there is no preparation to be done by the user to get ready for the next transaction. Read performance must be no more than 12 seconds to keep users satisfied.
After installing the computer system and the application, some system administrators or performance managers have been known to create dummy workloads on the systems before letting any users access the applications or the system. The initial users perceive a certain response time following their inputs. As more users are added, the dummy workload is reduced, thus providing a constant response time to the users. This trick attempts to address another issue with the perception of actual performance. Users prefer consistent responsiveness rather than variable responsiveness. If someone decides to run some resource-intensive batch jobs while interactive users are on the system, interactive performance will typically degrade. End-of-month processing will usually consume a very large amount of system resources, making it necessary to either keep the interactive users off the system while it is being run, or to run it in off-hours.
Users can tolerate and accept consistently poor response time rather than response that is good one minute and poor the next. The acceptable variance in response time changes as the average response time itself changes. Users will tolerate an average response time of 1.5 seconds with a variance of ± 1 second much less than they will tolerate an average response time of 3 seconds with a variance of ± .25 second. One problem associated with attempting to prevent variability in performance is that when throughput is favored in tuning the system, the chance of experiencing variability in response time is greatly increased.
Predictability is another way of looking at consistency of performance. Predictability makes the job of the performance professional easier when forecasting future resource consumption. It also allows the appropriate setting of expectations for performance, as the following analogy shows.
The public transportation department announces that buses on a particular bus route are scheduled to arrive at each stop an average of every ten minutes. In a given thirty-minute period, three buses arrive all at once and the next one arrives forty minutes later. This schedule meets the stated criteria. However, it will make the people waiting for the fourth bus very unhappy. It would be much better for the waiting passengers if the buses were to arrive consistently ten minutes apart. It would also be perceived well if the buses were to arrive predictably at certain clock times.
3.1.3.2 Throughput
Information system management personnel are typically interested in throughput, rather than response time. If the demands of the organization require that a certain amount of work must be processed in an average day, then management is concerned whether the system can process that workload, rather than caring whether response time is one second or two seconds. Throughput is often quoted as work per unit time. Examples of throughput measures are:
-
Three thousand invoices processed per hour
-
Five CAD modeling runs per day
-
All end-of-month processing must complete within three days
-
Overnight processing must complete in the nine off-peak hours
It is not possible to develop guidelines for good throughput values. Throughput is driven by business needs, and the system must be sized to support those requirements. Capacity planning is done to ensure that as business requirements grow, the system will be able to handle the workload. It should be easy to predict in advance whether a system will be able to provide a specified throughput. The workload is considered a batch workload and the average time to complete a unit of work can be measured with a simple benchmark.
In reality, the situation is never that simple. Users are typically very vocal, and poor response time often reaches the ears of management. The point of this discussion is that the definition of performance may change from person to person; attitudes about response time and throughput must be examined to determine what users consider to be acceptable performance. Although the overall methodology is the same, tuning for response time and for throughput are different. Another way of putting this is that there is one strategy for approaching performance, but there are many different tactics.
3.1.4 Known Peak Periods
It is useful to identify known peak periods in advance, so that unusual spikes in the data can be readily explained. For instance, it is often mentioned that resource utilization in an office environment peaks at 11:00 a.m. and between 2:003:00 p.m. during the normal work day. Processing requirements typically grow at the end of the month or the end of a quarter. Expecting peaks at these times can save time when analyzing the data. Additionally, if the peaks are absent, it may be a clue that something unusual is preventing full system utilization.
3.1.5 Sudden Changes
Anyone who has worked in technical support has experienced callers complaining that the system "suddenly" is no longer working correctly or that performance has suddenly degraded. The following is a familiar dialogue:
From Bob's Consulting Log
Consultant: Has anything changed in the application or the system?
Client: No, nothing has changed.
Consultant: Are you sure that nothing has changed?
Client: I'm quite sure nothing has changed.
(Three days later, after a lot of investigation ... )
Consultant: Did you notice that your database administrator (DBA) dropped some indexes?
Client: Oh! I didn't think those changes would make a difference.
The task of investigating changes that may have been made to the system is quite an art. However, the importance of this part of the assessment should not be minimized.
3.1.6 Duration of the Problem
When doing performance problem diagnosis, identifying the duration of the problem involves several issues:
-
How long does the performance problem last?
-
When does the performance problem occur?
-
How long has the performance problem existed?
-
When did it begin?
It is important to understand the length of time that the problem lasts to detect whether it occurs only sometimes, because of a spike in resource utilization, or constantly. In each of these situations, tuning requires a different approach.
Knowing when the performance problem occurs means that data collection can be planned and minimized, as an alternative to collecting performance data for days or weeks to capture data when the problem manifests itself.
Finally, the length of time that the performance problem has existed influences the probability of determining if anything in the system or application has been changed. If the problem has existed for a long time (weeks or months), it is very unlikely that any changes will be discovered. One can also question the seriousness of the situation if the users have been living with the problem for months.
3.1.7 Understanding the Options
Understanding the options lets you determine what recommendations should be offered. If there is no capital budget for purchasing computer hardware, you can look for other ways to resolve the performance problem. Perhaps tuning the operating system or application is a viable alternative to upgrading the CPU to a faster model. In contrast, if time constraints dictate that the problem must be resolved quickly and deterministically, then upgrading the CPU to a faster model would probably be more expeditious than spending an unpredictable amount of time attempting to improve or redesign the application.
3.1.8 Understanding the Politics
It may be necessary to understand the politics of the organization before revealing the cause of the problem or before recommending the changes to be implemented. Knowledge of the politics may help narrow the scope of the question you are trying to answer. It may be that the user organization wants to gain more control of the system, and it is trying to gather evidence to support this cause. The Information Technology (IT) department (also called Information Systems in some organizations) may be trying to allocate computing resources fairly, and it may not be possible to make changes that improve the situation for only one group. Finally, you may have been called in simply to provide objective data to justify the purchase of a larger, faster system.