A Look at Failures
When we utilize multiple machines with their own disk drives, network interconnects, processors, and memory units, the likelihood of failures becomes a significant concern. Consider the hard disk failure probability. If a disk has a failure rate of once in 1000 days, the probability of it failing on any given day is 1/1000, which may not be a major concern on its own. However, if we have 1000 disks, the probability of at least one disk failing on a given day becomes 1. If the partitioned data is being served from the disk that fails, it will become unavailable until the disk is recovered.
To gain insights into the types of failures that can occur look at the failure statistics from Jeff Dean’s 2009 talk [Dean2009] on Google’s data centers as shown in Table 1.1. Although these numbers are from 2009, they still provide a valuable representation of failure patterns.
Table 1.1 Failure Events per Year for a Cluster in a Data Center from Jeff Dean’s 2009 Talk [Dean2009]
Event |
Details |
---|---|
Overheating |
Power down most machines in < 5 min (~1–2 days to recover) |
PDU Failure |
~500–1000 machines suddenly disappear (~6 hours to come back) |
Rack Move |
Plenty of warning, ~500–1000 machines powered down (~6 hours) |
Network Rewiring |
Rolling ~5% of machines down over 2-day span |
Rack Failures |
40–80 machines instantly disappear (1–6 hours to get back) |
Racks Go Wonky |
40–80 machines see 50% packet loss |
Network Maintenances |
4 might cause ~30-minute random connectivity losses |
Router Reloads |
Takes out DNS and external VIPs for a couple minutes |
Router Failures |
Have to immediately pull traffic for an hour |
Minor DNS Blips |
Dozens of 30-second blips for DNS |
Individual Machine Failures |
1000 individual machine failures |
Hard Drive Failures |
Thousands of hard drive failures |
When distributing stateless compute across multiple servers, failures can be managed relatively easily. If a server responsible for handling user requests fails, the requests can be redirected to another server, or a new server can be added to take over the workload. Since stateless compute does not rely on specific data stored on a server, any server can begin serving requests from any user without the need to load specific data beforehand.
Failures become particularly challenging when dealing with data. Creating a separate instance on a random server is not as straightforward. It requires careful consideration to ensure that the servers start in the correct state and coordinate with other nodes to avoid serving incorrect or stale data. This book mainly focuses on systems that face these types of challenges.
To ensure that the system remains functional even if certain components are experiencing failures, simply distributing data across cluster nodes is often insufficient. It is crucial to effectively mask the failures.