The Object Layer
The middle layer of ZFS is the transactional object layer. The core of this layer is the Data Management Unit (DMU), and on many ZFS block diagrams the DMU is all that you see from this layer. The DMU exposes objects to the top layer and allows atomic operations to be performed.
If you’ve ever had a power failure while writing a file, you probably ran fsck, scandisk, or some equivalent. At the end, you likely ended up with a few corrupted files. If they were text files, you might be lucky; the corruption could be undone quite easily. If the files had a complex structure, on the other hand, you might have lost the entire file. Database applications avoid this problem by using a transactional mechanism; they write something on the disk saying "I’m about to do this," and then they do it; then they write "I’ve done this" in the log. If something goes wrong somewhere in the middle, the database app can simply roll back to before the start.
A lot of recent filesystems use journaling, which does the same thing as a database on a filesystem level. The advantage of journaling is that the filesystem state is always consistent; after a power failure, you just need to play back the journal, not scan the entire disk. Unfortunately, this consistency doesn’t extend to files. If you issue two writes from a userspace application, it’s possible that one will complete, while the other won’t. This design causes some problems.
ZFS uses a transactional model. You can begin a transaction, issue a number of writes, and either all will succeed or all will fail. This setup is possible because ZFS uses a copy-on-write mechanism. Every time you write some data, ZFS writes that data to a spare bit of the disk. It then updates the metadata to say "This is the new version." If the write process doesn’t reach the stage of updating the metadata, none of the old data will have been overwritten.
One side effect of copy-on-write is that it allows for creating constant time snapshots. Some filesystems, such as UFS2 on FreeBSD and XFS on IRIX, already support snapshots, so that’s not a new feature conceptually. The standard technique is to create a snapshot partition. Once you’ve created a snapshot, every write operation is replaced by a sequence that copies the original to the snapshot partition and then performs the write. Needless to say, this approach is quite expensive.
With ZFS, all you need to do to create a snapshot is increment the reference count of the partition. Every write operation is already nondestructive, so all that will happen is that the metadata-update operation won’t remove references to the old location. Another side effect of this mechanism is that snapshots are first-class filesystems in their own right. You can write to a snapshot, just as you would write to any other volume. For example, you can create a snapshot of a filesystem for each user and allow him to do whatever he wants to it, without affecting other users. This feature is particularly useful in combination with Solaris Zones.