Some notes on our new generation of ZFS-based file servers

Consistently among the most popular posts on this blog are a series I wrote very early on about our architecture (at work) for big file servers based on commodity hardware, FreeBSD, and ZFS (part 1, part 2). We are in the process of replacing our older generation of servers with newer technology, taking advantage of the increase in disk capacity to move from 96-drive servers that take half a rack to 4U, 24-drive servers. This does have some consequences for capacity and performance (although the increased memory in these servers should hopefully mitigate the performance concerns). I’m presently engaged in copying the data off one of the old servers (using zfs send | zfs recv) to new servers so that we can unrack the old server and make room for two more new servers. On of the biggest datasets just finished its initial copy today, and I wrote a few things for my colleagues about this process in our office Jabber conference. The rest of this post is a lightly edited version of that.

$ df -i /export/vision/[redacted]
Filesystem                 1K-blocks        Used      Avail Capacity   iused       ifree %iused  Mounted on
export/vision/[redacted] 28926547412 23434396266 5492151146    81% 527217855 10984302292    5%   /export/vision/[redacted]

This filesystem might finally be fully moved on Monday, after four weeks of effort.
[It finished today after three previous attempts that all bailed out when the dataset exceeded quota on the new server; the following explains why. Another sync will be required on Monday to catch up with any updates the users may have made.]

527 million files, for an average size of 55 KiB per file. Moving this from a filesystem with 512-byte blocks and 3072-byte RAID stripe to a filesystem with 4096-byte blocks and 20480-byte RAID stripe expands the data by about 5 TiB.

FWIW, the capacity of the new servers in RAID-Z2 (3×7) is 98 TiB, whereas RAID-1 (11×2) is 77 TiB…. Wondering if I should configure one of the new servers as 2×11 RAID-Z2 just to see how that works out (higher fragmentation for these vision filesystems, and slower, but theoretically higher capacity). For those who are a bit confused by this: RAID-Z is not like RAID-5, and doesn’t have a fixed stripe size; to avoid the “write hole” (requiring read-modify-write cycles to fill out a partial stripe), RAID-Z writes short stripes with full parity if there’s not enough data to make a full stripe. In the worst case, RAID-Z2 will write two parity blocks for a single data block if that’s all there is to be written. The old servers have 88 active drives, arranged in 11 stripes of 8 drives each with RAID-Z2, but the drives have 512-byte sectors so a “full stripe” is 6×512 = 3072 bytes plus parity; the new servers as currently conceived have 21 active drives in 3 stripes of 7, again with two parity, but 4k sectors mean that a full stripe is 5×4096 = 20480 bytes. [The new servers have 24 drive bays, leaving 23 available for data after ZIL, but we built them with 22 drives including a hot spare.] So if average write size (after compression) is 8k or less, you’re better off with mirroring rather than RAID-Z2.

A bit more about our ZFS dataset migration process. These filesystems are all NFS exported read-write to the client systems. Most of the data movement takes place without user involvement, taking advantage of our existing daily, monthly, and hourly snapshots to make efficient, self-consistent copies of each filesystem. Once this process is complete, we coordinate downtime with the users. At the beginning of the downtime window, we set each dataset to be migrated readonly=on and wait for the next hourly snapshot. (Doing it this way ensures that old manual snapshots are not accidentally left hanging around consuming disk space.) We then replicate that snapshot to the new server and clear the readonly property on the destination datasets with zfs inherit. Updated automount maps pointing to the new server are pushed to our Puppet masters, and if the users actually followed instructions, all of the client machines will get updated within half an hour. (Sometimes the users don’t follow instructions and the old server has to be force-unmounted to allow the automounter to mount the new location.) Finally, if we are backing up the filesystems, the backup server is updated to reflect the migration. (There’s no need to take a new full because ZFS replication preserves inode numbers and times.)

This entry was posted in Computing, FreeBSD, ZFS and tagged , , . Bookmark the permalink.