Summary
Many businesses implement highly available clusters when the risk of the costs of downtime exceeds the costs of the cluster. This can be quantified in business terms and measured by well-designed systems.
We examined how failures occur in complex systems and showed methods that contain, isolate, report, and repair failures. Synchronization is used by any component or system that creates copies of data, including redundant storage, caches, and cluster components. Arbitration is important in clustered systems for deciding which of many components is the appropriate configuration to provide services.
Finally, we examined special considerations for clustered systems. Many systems use caches to improve performance. Caches introduce synchronization problems because they represent duplication of data. Timeouts are used to try to determine if a component has failed, but care must be taken to ensure that timeouts are properly tuned.
A number of failure modes are specific to clusterssplit brain, amnesia, and multiple instances. Clusters use special techniques to help prevent such failures.
Author's Bio: Richard Elling
Richard Elling is the Chief Architect for Enterprise Engineering at Sun Microsystems in San Diego, California. Richard had been a field systems engineer at Sun for five years. He was the Sun Field Systems Engineer of the Year in 1996. Prior to Sun he was the Manager of Network Support for the College of Engineering at Auburn University, a design center for startup microelectronics company, and worked for NASA doing electronic design and experiments integration for Space Shuttle missions.
Author's Bio: Tim Read
Tim Read is a Lead Consultant for the High End Systems Group in Sun UK's Joint Technology Organization. Since 1985 he has worked in the UK computer industry, joining Sun in 1990. He holds a BSc in Physics with Astrophysics from Birmingham University. As part of his undergraduate studies Tim studied clusters of Suns; now he teaches and writes about Sun clusters.