Measurement Granularity
The SLA must describe the granularity of the measurements. There are three related parts to that granularity: the scope, the sampling frequency, and the aggregation interval.
Measurement Scope
The first consideration is the scope of the measurement, and availability metrics make an excellent example. Many providers define the availability of their services based on an overall average of availability across all access points. This is an approach that gives the service providers the most flexibility and cushion for meeting negotiated levels.
Consider if your company had 100 sites and a target of 99 percent availability based on an overall average. Ninety-nine of your sites could have complete availability (100 percent) while one could have zero. Having a site with an extended period of complete unavailability isn't usually acceptable, but the service provider has complied with the negotiated terms of the SLA.
If the availability level is specified on a per-site basis instead, the provider would have been found to be noncompliant and appropriate actions would follow in the form of penalties or lost customers. The same principle applies when measuring the availability of multiple sites, servers, or other units.
Availability has an additional scope dimension, in addition to breadth: the depth to which the end user can penetrate to the desired service. To use a telephone analogy, is dial tone sufficient, or must the end user be able to reach specific numbers? In other words, which transactions must be accessible for the system to be regarded as available?
Scope issues for performance metrics are similar to those for the availability metric. There may be different sets of metrics for different groups of transactions, different times of day, and different groups of end users. Some transactions may be unusually important to particular groups of end users at particular times and completely unimportant at other times.
Regardless of the scope selected for a given individual metric, it's important to realize that executive management will want these various metrics aggregated into a single measure of overall performance. Derivation of that aggregated metric must be addressed during measurement definition.
Measurement Sampling Frequency
A shorter sampling frequency catches problems sooner at the expense of consuming additional network, server, and application resources. Longer intervals between measurements reduce the impacts while possibly missing important changes, or at least not detecting them as quickly as when a shorter interval is used. Customers and the service providers will need to negotiate the measurement interval because it affects the cost of the service to some extent.
Statisticians recommend that sampling be random because it avoids accidental synchronization with underlying processes and the resulting distortion of the metric. Random sampling also helps discover brief patterns of poor performance; consecutive bad results are more meaningful than individual, spaced-out difficulties.
Confidence interval calculations can be used to help determine the sampling frequency. Although it is impossible to perform an infinite number of measurements, it is possible to calculate a range of values that we're reasonably sure would contain the true summary values (median, average, and so on) if you could have performed an infinite number of measurements. For example, you might want to be able to say the following: "There's a 95 percent chance that the true median, if we could perform an infinite number of measurements, would be between five seconds and seven seconds." That is what the "95 Percent Confidence Interval" seeks to estimate, as shown in Figure 2-4. When you take more measurements, the confidence interval (two seconds in this example) usually becomes narrower. Therefore, confidence intervals can be used to help estimate how many measurements you'll need to obtain a given level of precision with statistical confidence.
Figure 2-4 Confidence Interval for Internet Data
There are simple techniques for calculating confidence intervals for "normal distributions" of data (the familiar bell-shaped curve). Unfortunately, as discussed in the subsequent section on statistical analysis, Internet distributions are so different from the "normal distribution" that these techniques cannot be used. Instead, the statistical simulation technique known as "bootstrapping" can be used for these calculations on Internet distributions.
In some cases, depending on the pattern of measurements, simple approximations for calculating confidence intervals may be used. Keynote Systems recommends the following calculation approximation for calculating the confidence interval for availability metrics. (This information is drawn from "Keynote Data Accuracy and Statistical Analysis for Performance Trending and Service Level Management," Keynote Systems Inc., San Mateo, California, 2002.) The formula is as follows:
Omit data points that indicate measurement problems instead of availability problems.
Calculate a preliminary estimate of the 95 percent confidence interval for average availability (avg) of a measurement sample with n valid data points:
Now you must decide if the preliminary calculations are reasonable. We suggest that the preliminary calculation should be accepted only if the upper limit is below 100 percent and the lower limit is above 0 percent. (The example just used gives an upper limit > 100% for n = 29 or fewer, so this rule suggests that the calculation is reasonable if n = 30 or greater.)
Preliminary 95 Percent Confidence Interval = avg _ (1.96 * square root [(avg * (1 avg))/(n 1)])
For example, with a sample size n of 100, if 12 percent of the valid measurements are errors, the average availability is 88 percent. The confidence interval is calculated by the formula as (0.82, 0.94). This suggests that there's a 95 percent probability that the true average availabilityif we'd miraculously taken an infinite number of measurementsis between 82 and 94 percent. Notice that even with 100 measurements, this confidence interval leaves much room for uncertainty! To narrow that band, you need more valid measurements (a larger n, such as 1000 data points).
Note that we're not saying that the confidence interval is too wide if the upper limit is above 100 percent (or if the average availability itself is 100 percent because no errors were detected); we're saying that you don't know what the confidence interval is. The reason is that the simplifying assumptions you used to construct the calculation break down if there are not enough data points.
For performance metrics, a simple solution to the problem of confidence intervals is to use geometric means and "geometric deviations" as measures of performance, which are described in the subsequent section in this chapter on statistical analysis.
Keynote Systems suggests, in the paper previously cited, that you can approximate the 95 Percent Confidence Interval for the geometric mean as follows, for a measurement sample with n valid (nonerror) data points:
Upper Limit = [geometric mean] * [ (geometric deviation) (1.96 / (square root of [n 1] ) ) ]
Lower Limit = [geometric mean] / [ (geometric deviation) (1.96 / (square root of [n 1] ) ) ]
This is similar to the use of the standard deviation with normally distributed data and can be used as a rough approximation of confidence intervals for performance measurements. Note that this ignores cyclic variations, such as by time of day or day of week; it is also somewhat distorted because even the logarithms of the original data are asymmetrically distributed, sometimes with a skew greater than 3. Nevertheless, the errors encountered using this recipe are much less than those that result from the usual use of mean and standard deviation.
Measurement Aggregation Interval
Selecting the time interval over which availability and performance are aggregated should also be considered. Generally, providers and customers agree upon time spans ranging from a week to a month. These are practical time intervals because they will tend to hide small fluctuations and irrelevant outlying measurements, but still enable reasonably prompt analysis and response. Longer intervals enable longer problem periods before the SLA is violated.
Table 2-2 shows this idea. If availability is measured on a small scale (hourly), high availability and requirements such as the 5-9's or 99.999% permit only 0.036 seconds of outage before there's a breach of the SLA. Providers must provision with adequate redundancy to meet this type of stringent requirement, and clearly they will pass on these costs to the customers that demand such high availability.
Table 2-2 Measurement Aggregation Intervals for Availability
Availability Percentage |
Allowable Outage for Specified Aggregation Intervals |
|
|
|
|
Hour |
Day |
Week |
4 Weeks |
98% |
1.2 min |
28.8 min |
3.36 hr |
13.4 hr |
98.5% |
0.9 min |
21.6 min |
2.52 hr |
10 hr |
99% |
0.6 min |
14.4 min |
1.68 hr |
6.7 hr |
99.5% |
0.3 min |
7.2 min |
50.4 min |
3.36 hr |
99.9% |
3.6 sec |
1.44 min |
10 min |
40 min |
99.99% |
0.36 sec |
8.64 sec |
1 min |
4 min |
99.999% |
0.036 sec |
0.864 sec |
6 sec |
24 sec |
If a monthly (four-week) measurement interval is chosen, the 99.999 percent level indicates that an acceptable cumulative outage of 24 seconds per month is permitted while remaining in compliance. A 99.9 percent availability level permits up to 40 minutes of accumulated downtime for a service each month. Many providers are still trying to negotiate an SLA with availability levels ranging from 98 to 99.5 percent, or cumulative downtimes of 13.4 to 3.5 hours each month.
Note that these values assume 24___7___365 operations. For operations that do not require round-the-clock availability, or are not up during weekends, or have scheduled maintenance periods, the values will change. That said, they're pretty easy to compute.
The key is for service provider and service customer to set a common definition of the critical time interval. Because longer aggregation intervals permit longer periods during which metrics may be outside tolerance, many organizations must look more deeply at their aggregation definitions and look to their tolerance for service interruption. A 98 percent availability level may be adequate and also economically acceptable, but how would the business function if the 13.5 allotted hours of downtime per month occurred in a single outage? Could the business tolerate an interruption of that length without serious damage? If not, then another metric that limits the interruption must be incorporated. This could be expressed in a statement such as the following: "Monthly availability at all sites shall be 98 percent or higher, and no service outage shall exceed three minutes." In other words, a little arithmetic to evaluate scenarios for compliance goes a long way.