Summary
This chapter covers a lot of territory and sets the stage for the following chapter discussions that cover different aspects of actually managing services. Successful service management is predicated on delivering acceptable service quality at acceptable price points and within acceptable time frames. Correctly handled, it improves service quality, improves relationships with suppliers, and may even lower total costs.
The SLA is the basic tool used to define acceptable quality and any relationships between quality and price. It is a formal, negotiated contract between a service provider and a service user that defines the services to be provided, the service quality goals (often called service level indicators and service level objectives), and the actions to be taken if the service provider does not comply with the SLA terms.
Measurement is a key part of an SLA, and most SLAs have two different classes of metrics, technical and business process metrics. Technical metrics include both high-level technical metrics, such as the success rate of an entire transaction as seen by an end user, and low-level technical metrics, such as the error rate of an underlying communications network. Business process metrics include measures of provider business practices, such as the speed with which they respond to problem reports. Metrics should also include measures of the workload expected. Service providers may package the metrics into specific profiles that suit common customer requirements while simplifying the process of selecting and specifying the parameters.
In any case, a properly constructed SLA is based on metrics that are relevant to the end-user experience. Many of the low-level technical metrics, such as communications packet loss, have complex relationships to end-user experience; it's usually much better to use high-level technical metrics that directly measure end-user experience, such as web page download time and transaction time. The low-level technical metrics can then be derived from the high-level technical metrics and used to manage subordinate systems.
SLA metrics must be carefully defined in terms of scope, sampling frequency, and aggregation interval:
Scope represents the breadth of measurement (for example, the number of test points from which availability is measured and the percentage of them that must be unavailable for the entire system to be marked as unavailable).
Measurement sampling should be random, and the sampling frequency should be chosen to provide timely alerts when problems occur and to provide the appropriate confidence intervals for availability and performance measurement. Calculation of confidence intervals is unfortunately complex for Internet statistics, as the usual formulas, suited for normal distributions, cannot be used. Instead, statistical simulation through bootstrapping or the approximations discussed in the body of this chapter can provide estimates of the number of measurements needed to provide reasonable statistics.
The aggregation interval is also important, as longer intervals, often chosen in SLAs, may allow long periods of sub-par performance. The tolerance for service interruption then becomes important and may need to be separately specified.
Measurements must also be validated and subjected to statistical treatment when used in SLAs, and the methods for that validation and treatment must be documented in the SLA to avoid dispute. Validation ensures that erroneous measurements are removed, insofar as is possible, before computation of the metrics used in the SLA. Statistical treatment ensures that outlying measurements do not create a misleading picture of the performance as perceived by end users, with the resulting waste of resources spent fixing what may be a minor issue. Arithmetic averages and standard deviations should not be used to handle Internet statistics.
Finally, the SLA should be written with penalty and reward clauses that are sufficient to inspire the performance the customer wants, and the goals should be set to ensure that the motivating quality of the SLA remains throughout the time period. Capped penalties or goals are examples of techniques that may motivate a supplier to abandon work on an account just because the cap has been reachedprobably not the desired behavior.
The service level indicators and objectives described in the SLA are then used by the operations staff and by automated systems to manage the service levels, as described in Chapters 6 and 7.