Home > Articles

This chapter is from the book 

This chapter is from the book

A Look at Failures

When we utilize multiple machines with their own disk drives, network interconnects, processors, and memory units, the likelihood of failures becomes a significant concern. Consider the hard disk failure probability. If a disk has a failure rate of once in 1000 days, the probability of it failing on any given day is 1/1000, which may not be a major concern on its own. However, if we have 1000 disks, the probability of at least one disk failing on a given day becomes 1. If the partitioned data is being served from the disk that fails, it will become unavailable until the disk is recovered.

To gain insights into the types of failures that can occur look at the failure statistics from Jeff Dean’s 2009 talk [Dean2009] on Google’s data centers as shown in Table 1.1. Although these numbers are from 2009, they still provide a valuable representation of failure patterns.

Table 1.1 Failure Events per Year for a Cluster in a Data Center from Jeff Dean’s 2009 Talk [Dean2009]

Event

Details

Overheating

Power down most machines in < 5 min (~1–2 days to recover)

PDU Failure

~500–1000 machines suddenly disappear (~6 hours to come back)

Rack Move

Plenty of warning, ~500–1000 machines powered down (~6 hours)

Network Rewiring

Rolling ~5% of machines down over 2-day span

Rack Failures

40–80 machines instantly disappear (1–6 hours to get back)

Racks Go Wonky

40–80 machines see 50% packet loss

Network Maintenances

4 might cause ~30-minute random connectivity losses

Router Reloads

Takes out DNS and external VIPs for a couple minutes

Router Failures

Have to immediately pull traffic for an hour

Minor DNS Blips

Dozens of 30-second blips for DNS

Individual Machine Failures

1000 individual machine failures

Hard Drive Failures

Thousands of hard drive failures

When distributing stateless compute across multiple servers, failures can be managed relatively easily. If a server responsible for handling user requests fails, the requests can be redirected to another server, or a new server can be added to take over the workload. Since stateless compute does not rely on specific data stored on a server, any server can begin serving requests from any user without the need to load specific data beforehand.

Failures become particularly challenging when dealing with data. Creating a separate instance on a random server is not as straightforward. It requires careful consideration to ensure that the servers start in the correct state and coordinate with other nodes to avoid serving incorrect or stale data. This book mainly focuses on systems that face these types of challenges.

To ensure that the system remains functional even if certain components are experiencing failures, simply distributing data across cluster nodes is often insufficient. It is crucial to effectively mask the failures.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.