Summary
Many businesses implement highly available clusters when the risk of the costs of downtime exceeds the costs of the cluster. This can be quantified in business terms and measured by well designed systems.
We examined how failures occur in complex systems and showed methods that contain, isolate, report, and repair failures. Synchronization is used by any component or system that creates copies of data, including redundant storage, caches, and cluster components. Arbitration is important in clustered systems for deciding which of many components is the appropriate configuration to provide services.
Finally, we examined special considerations for clustered systems. Many systems use caches to improve performance. Caches introduce synchronization problems because they represent duplication of data. Timeouts are used to try to determine if a component has failed, but care must be taken to ensure that timeouts are properly tuned.
A number of failure modes are specific to clusters split brain, amnesia, and multiple instances. Clusters use special techniques to help prevent such failures.