Ubuntu Server Fault Tolerance
Hardware fails. Over the years I have had basically every major hardware component on a server fail, from CPUs to RAM to SCSI controllers and, of course, hard drives. In addition to hardware failure, system downtime is often the result of some other problem such as a bad configuration on a switch, a power outage, or even a sysadmin accidentally rebooting the wrong server. If you lose money whenever a service is down, you quickly come up with methods to keep that service up no matter what component fails.
In this chapter I will discuss some of the methods you can use with Ubuntu servers to make them more fault-tolerant. I will start with some general fault tolerance principles. Then I will talk about ways to add fault tolerance to your storage and network with RAID and Ethernet bonding, respectively. Of course, even with those procedures in place your server could crash or reboot, or you could lose a CPU, so finally I will talk about how to set up a basic two-server cluster.
Fault Tolerance Principles
- Build redundant systems.
- Favor hot-swappable components.
- Test your redundancy.
- Eliminate any single points of failure.
- Respond to failures quickly.
The basic idea behind fault tolerance is to set up your systems so that you can lose any one component without an outage. These days servers with redundant power supplies and redundant disks are common. There are even servers that have redundant BIOSs and remote management ports. The downside with redundancy is that it is often wasteful. For instance, with RAID you typically lose at least one disk’s worth of storage for redundancy. When compared to the cost of downtime, though, for most sysadmins it is worth the extra cost.
RAID is great because it protects you from losing data and your host going down because of a disk failure, but if you have to power down the host to replace the drive, you get little benefit. Where possible, favor components that are hot-swappable. These days servers are likely to offer at least hot-swappable drives and power supplies, and many have hot-swappable fans as well. In some higher-end blade servers you can even hot-swap integrated network and SAN switches and remote management cards.
As with backups, if you haven’t tested your fault tolerance, then you don’t have fault tolerance. If possible, before you deploy a new redundant system such as Ethernet bonding or server clustering, be sure to simulate failures and understand both how the system responds to a failure as well as how it responds once the fault has been repaired. Systems can behave very differently in both how they handle a fault and how they resume after the fault is repaired, all based on how you configure them. This testing phase is also a good time to test any monitoring you have put in place to detect these failures.
While having some redundancy is better than having none, try to go through the entire server stack and identify and eliminate any single points of failure. For instance, if you have set up redundant power sources for your data center and each server has a power supply hooked into one of the power sources, it is less useful if the servers are connected to one switch with a single power supply. For larger operations, even a data center itself is seen as a single point of failure, so in those cases servers are distributed across multiple data centers in entirely different geographical locations.
When a component fails, try to identify and repair the problem as soon as you can. In RAID, for instance, many sysadmins set up a disk as a hot spare so that the moment a disk fails, a replacement can take its place. Provided the hot spare syncs before another drive fails, the data will still be intact. While you can’t do this with every component, when you do have a fault, try to repair it before you lose the fail-over side as well.