The Last Word in RAID, Too?
One of the most exciting features about ZFS is RAID-Z. A modern hard disk is a device with a fairly boring interface. It’s an array of fixed-size blocks that can be read from or written to. Since RAID is typically implemented close to the block layer (often in hardware, transparently to the operating system), RAID devices also expose this interface. In a RAID-5 array with three disks, writing a block would involve storing the block on disk 1, and the result of XOR-ing the block with a corresponding one from disk 2 on disk 3. This plan has two associated problems:
- If you’re lucky, you can guarantee atomic writes on a single disk, but it’s almost impossible to get atomic writes on a group of disks. If something fails between writing the first block and the checksum, the system will contain nonsense for that block index on all disks. Modern RAID controllers get around this problem by storing writes in nonvolatile RAM until they receive confirmation from the disk that the data is safely stored.
- In the above scenario, writing one block on disk 1 requires that you then read a block from disk 2 and store the checksum on disk 3. This extra read operation in the middle of every write can be expensive.
So what does RAID-Z do differently? For one thing, a RAID-Z array isn’t quite as stupid as a RAID array; it has some awareness of what’s stored on it. The key ingredient is the variable-stripe width. With existing RAID implementations, stripes are either 1 byte (for example, every odd byte would be written to disk 1, every even byte to disk 2, and the parity byte to disk 3), or block-width. In ZFS, the stripe size is determined by the size of the write. Every time you write to the disk, you write a complete stripe.
This design eliminates both of the problems mentioned above. Since ZFS is transactional, a stripe is either written correctly and the metadata updated, or it isn’t. If the hardware fails mid-write, then that write will fail, but existing data on the disk won’t be affected. Similarly, because the stripe contains only the data being written, you never need to read anything back from the disk to perform a write.
RAID-Z is only possible because of the new layering structure in ZFS. You can repair a RAID-5 volume when a disk fails by going through and saying "XOR-ing all of the bits at index 0 on each disk together gives 0, so what must our missing disk have contained?" With RAID-Z, this is impossible. Instead, you have to traverse the filesystem metadata. A RAID controller that appears to be a block device wouldn’t be able to do this. One added bonus is that a hardware RAID controller has to reconstruct the entire disk—even blocks that haven’t been used—when conducting a repair, while a RAID-Z set only needs to reconstruct used blocks.
While it isn’t part of RAID-Z, ZFS includes one more feature that helps eliminate the problems of data corruption: Because every block stores an SHA256 hash, a damaged sector on disk will show up as containing errors even if the disk controller doesn’t notice it. This is an advantage over existing RAID implementations. With RAID-5, for example, you can restore an entire volume, but if a single sector on a disk breaks, all the disk can tell is that an error exists. A RAID-Z volume can tell you which disk contains the error (the one whose block doesn’t match the hash), and so can reconstruct the data from the others. This also serves as an early warning that a disk is likely to fail.
With all this variable stripe size, you may wonder what happens when a stripe is smaller than an allocation unit. The answer is simple: Instead of computing parity, ZFS simply mirrors the data.
One thing I’ve found particularly interesting about ZFS is that it will perform much better on a block device that has a low cost for random reads. It’s almost as if the designers had flash, rather than hard disks, in their minds.