High-Availability Servers
While other hardware is important, the heart of any high availability system is the servers. Broadly speaking, there are three major approaches to high availability servers:
- Standby systems
- Clusters with automatic failover
- High-availability servers
Let’s look at the differences between these options.
Standby Systems
Perhaps the cheapest method of providing high availability is to have a second server ready to take over if the main server fails. There are three ways to go:
- Cold spare, waiting to be booted when needed
- Hot spare, already running but not carrying any of the load
- Separate server, carrying only part of the load
That last option is especially common in web servers and other systems that lend themselves naturally to load balancing.
Standby systems are simple—or about as simple as anything gets in the world of high availability—and adequate for a lot of companies’ needs.
The disadvantages of standby systems are time and human intervention. Unlike a cluster (discussed shortly), a standby system has only limited automatic failover capability. While servers that are already sharing the work in a load-balancing scheme can fail over automatically, hot spares and cold spares generally can’t. Typically someone has to switch over to the spare, which takes time and available staff. There is usually some time involved in making the switch, which can lead to data loss (depending on the design). Still, that simplicity, along with lower cost compared to a cluster, makes standbys a popular approach.
Of course, in dealing with hot IT products, categories are seldom hard and fast. In an attempt to have the best of all worlds, some vendors have introduced products that are less than clusters, but more than conventional standby servers. One such product is EverRun from Marathon Technologies. EverRun runs on a matched set of Windows systems, either single or dual-processor, running the same application simultaneously and using a virtualization layer to look like a single server to the rest of the system. Because the systems are so closely tied, Marathon promises no data loss and near-immediate switchover.
Clusters with Automatic Failover
This option is what Windows administrators instinctively think of when someone says "high availability." Products such as Microsoft’s Windows Cluster Server and similar offerings from EMC Legato Software, Symantec’s Veritas, and others provide high availability by tightly linking several computers in "nodes" into a cluster. The nodes monitor each other through an out-of-band channel (a separate network path); when one of the nodes falters, its partners in the cluster recognize the situation and pick up the load. The time needed for the other nodes to recognize a failure and respond to it sets a minimum downtime for each failure.
Clusters are more expensive than standby systems; they’re more tightly integrated and hence more complex. While none of the high-availability solutions discussed here is exactly "simple," clusters also have the reputation of requiring a lot more skill to set up than the other options; they’re more complicated than standby servers, and they don’t have the kind of built-in simplifications that come with using specially designed hardware for high-availability servers. Fundamentally, a cluster is not a product; it’s a package of hardware and software products that have to work together. Someone has to integrate all that stuff and make sure that it’s working properly.
In a lot of ways, clusters are about where storage area networks were not long ago. They’ve definitely arrived as products, but they’re still short on ease of installation. Because of this drawback, Microsoft has instituted the Datacenter High Availability Program for Windows Server 2003. You can think of it as analogous to the SAN-in-a-box solution that a number of vendors introduced a few years ago. The Datacenter High Availability Program allows customers to buy a Microsoft cluster installation in a pretested, preconfigured hardware and software package from a certified vendor such as IBM or Unisys, or through a certified VAR, and get extra support from Microsoft as well.
Clusters work best with cluster-aware software, which is written with the special characteristics of clusters, in mind; for example, failover behavior. Many major applications from ERP to email are available in cluster-aware versions. In addition, a lot of regular software works just fine on a cluster.
High-Availability Servers
While clusters are really software products running (usually) on top of standard hardware, companies such as Stratus are designing and building Windows servers to be highly reliable. Each server typically has redundant everything, including processors running simultaneously. (Stratus even has the option to add a third processor board to its servers for further redundancy.)
These high-availability servers have a high degree of built-in redundancy and special software and hardware to track, correct, and report errors. For example, the products typically have elaborate monitoring software to check everything from disk errors to processor temperatures. Stratus also has a program called ActiveService, in which the server automatically notifies Stratus in the event of a problem so Stratus technicians can fix it remotely. The server can even be set to order replacement parts automatically.
High-availability servers generally have the highest up-front costs of any option, although they can be challenged by some cluster configurations. Properly configured and installed, such servers can offer extremely high availability with minimal chance of data loss.