Christian Lehnert — Linux, Hacking & Faith

ZFS on the Homelab - Checksums, Snapshots, and the End of Silent Bit Rot

Christian Lehnert2017-08-09~8 min read

Here is a fact that should bother you more than it does. Your filesystem cannot tell when your disk hands it the wrong data. A drive returns a block, ext4 takes it, and if a bit flipped on the platter or in the cable or in the controller's cache, ext4 has no way to know. It passes the corruption straight to your application and reports success. The file is rotten and every layer above the disk insists it is fine. This is silent data corruption, bit rot, and on a multi-terabyte array running for years it is not a question of whether it happens but how much of it you have already accumulated without noticing.

ZFS was built by people who refused to accept that. It trusts nothing: not the disk, not the cable, not the controller, not even its own RAM more than it has to. ZFS on Linux 0.7 shipped in July, and on the Stretch box from my last post it is one contrib package away. So this is the storage layer I now put under anything I actually care about keeping, and here is why, along with the legal asterisk you inherit by running it on Linux.

Your filesystem is lying to you, ZFS isn't

The core idea is end-to-end checksumming. Every block ZFS writes gets a checksum, and that checksum is stored not next to the block but in the block's parent in the tree, all the way up to a checksummed root. When ZFS reads a block, it verifies it against the checksum its parent recorded. If they disagree, ZFS knows, with certainty, that what came off the disk is not what it wrote.

On its own that turns silent corruption into loud, detectable corruption, which is already a massive upgrade. But if the pool has redundancy, a mirror or RAID-Z, ZFS does better than detect it. It pulls the good copy from the other device, hands your application the correct data, and rewrites the bad block in place. This is self-healing, and it runs proactively when you scrub: ZFS walks every block in the pool, verifies all of it, and repairs everything it can while the array is still otherwise healthy. You schedule a scrub, and silent rot becomes an email instead of a discovery you make years later when you open an old photo and find half of it is grey static.

That is the whole religion. Everything else ZFS does is good, but data integrity is the reason it exists.

Snapshots that cost nothing

ZFS is copy-on-write: it never overwrites a live block, it writes a new one and updates the pointers. A consequence falls out of that for free. A snapshot is just a refusal to free the old blocks. Taking one is instant and consumes no space at the moment you take it; it only grows as the live data diverges from the frozen view. So you can snapshot a dataset every hour, keep a deep history, and roll back to any of them in seconds.

The same machinery gives you replication. zfs send serialises a snapshot into a stream, and zfs recv reconstructs it on another pool, which can be on another machine over SSH. Send a full snapshot once, then send only the incremental difference between snapshots after that, and you have efficient off-site backup that is consistent by construction, because a snapshot is a single atomic point in time rather than a rsync racing against a live directory. The 0.7 release sharpens exactly this: sends can now be compressed end to end so you do not waste bandwidth recompressing, and an interrupted send can resume from a token instead of starting over, which matters the first time a multi-terabyte replication drops at 80 percent over a flaky link.

zfs snapshot tank/backups@2017-08-09
zfs send -c tank/backups@2017-08-09 | ssh nas2 zfs recv tank/backups
# a week later, send only what changed:
zfs send -c -i @2017-08-09 tank/backups@2017-08-16 | ssh nas2 zfs recv tank/backups

How a pool is actually built

ZFS folds the volume manager and the filesystem into one thing, so the vocabulary is worth getting straight. Physical disks are grouped into a vdev; vdevs are combined into a pool (zpool); and on top of the pool you create datasets, which behave like filesystems but share the pool's free space and each carry their own properties.

Redundancy lives at the vdev level. A mirror is two or more disks holding the same data, simple and fast to resilver. RAID-Z1, Z2, and Z3 are parity schemes tolerating one, two, or three failed disks per vdev, more space-efficient but slower to rebuild. For a homelab my advice is unfashionable but correct: mirrors for anything you will want to grow or reshape later, RAID-Z2 for bulk storage you will build once and leave alone. Whatever you pick, create the pool with ashift=12 for modern 4K-sector drives, and always reference disks by their /dev/disk/by-id paths, never /dev/sda, because sda is not stable across reboots and you do not want your pool confused about which disk is which.

Turn on lz4 compression and leave it on. It is fast enough that on most data it costs nothing measurable and frequently makes the pool faster, because the bytes it does not have to read off the disk more than pay for the CPU.

Two things to know going in. First, ZFS is hungry for RAM, because it uses free memory as a read cache called the ARC, and that cache is a large part of why it performs well. On a box doing other work, cap it with the zfs_arc_max module parameter so it does not crowd out everything else. Second, the ECC RAM question. ZFS does not require ECC, and the old horror story about a scrub destroying a pool on non-ECC memory is a myth. But ZFS is a machine for guaranteeing integrity, and it computes and trusts its checksums in RAM. Bad RAM can hand ZFS corruption that it then faithfully checksums and stores as if it were correct. If the entire point of the exercise is that your data is provably intact, putting non-ECC memory under it is undercutting the one thing you came for. Use ECC if the platform allows it.

Getting it onto Debian Stretch

Now the asterisk. ZFS is licensed under the CDDL, which is a free-software licence, and the Linux kernel is GPL, which is also a free-software licence, and the two are mutually incompatible in a way that means ZFS code cannot be merged into the mainline kernel. It lives instead as an out-of-tree module. Canonical decided that shipping that module precompiled in Ubuntu is acceptable, and the Software Freedom Conservancy and the FSF disagree fairly loudly. Debian took the more conservative road: ZFS is in the contrib archive area, as DKMS source rather than a prebuilt binary, which is exactly why it only became available for Debian's Linux kernel as of Stretch.

Enable contrib, install the headers and the DKMS package, and DKMS compiles the module against your running kernel:

# add contrib to your sources, e.g.:
# deb http://deb.debian.org/debian stretch main contrib
apt update
apt install linux-headers-$(uname -r) zfs-dkms zfsutils-linux zfs-zed
modprobe zfs

The first build takes a few minutes. And here is the recurring tax of out-of-tree DKMS that you must understand before you trust this in production: every time a new kernel lands, DKMS rebuilds the ZFS module against it, and if OpenZFS has not yet caught up to a kernel API change, that build fails and you reboot into a kernel that cannot import your pool. So do not run a blind apt full-upgrade that pulls a new kernel without first checking the module built. This is the single most common way ZFS-on-Debian bites people, and it is entirely avoidable with a moment of attention at upgrade time.

A pool, start to finish

zpool create -o ashift=12 tank mirror \
  /dev/disk/by-id/ata-WDC_WD40EFRX-XXXX \
  /dev/disk/by-id/ata-WDC_WD40EFRX-YYYY
 
zfs set compression=lz4 tank
zfs create tank/media
zfs create tank/backups
 
zpool scrub tank        # verify and self-heal everything
zpool status tank       # health, errors, scrub progress
zfs list -t snapshot    # what you can roll back to

That is a self-checking, self-healing, snapshot-capable, replicable storage pool in six commands.

Bottom line

ZFS is heavy. It wants RAM, it wants ECC, it wants you to think about vdevs before you create them because you cannot trivially reshape them afterward, and on Linux it drags a genuine licensing argument and a DKMS rebuild into your life. None of that is free.

What you get for it is the only filesystem in common reach that will tell you the truth about whether your data is still the data you wrote. ext4 will lie to you politely for years. ZFS scrubs the array, finds the three blocks that rotted, fixes them from redundancy, and sends you a clean report. For media you could re-download, that is a luxury. For backups, family photos, and anything irreplaceable, it is not a close call. Put the things you cannot lose on a filesystem that refuses to lose them quietly.

Tagged:
#zfs #storage #linux #debian #backup
← Back to posts