Occasionally Coherent

Building big NFS servers with FreeBSD/ZFS (2 of 2)

At CSAIL, we support two distributed file systems: AFS, which has a sensible security model (if somewhat dated in its implementation), but is slow and limited in storage capacity, and NFS, for people who don’t want security and do need either speed or high capacity (tens of terabytes in the worst case, as compared with hundreds of gigabytes for AFS). Historically, AFS has also been much cheaper, and scaled out much better, with NFS requiring expensive dedicated hardware to perform reasonably well, whereas AFS could run on J. Random Server (these days, Dell PowerEdge rackmounts, because they have great rails and a usable [SSH, not IPMI] remote console interface) with internal drives and when we needed more capacity, we could just buy another server and fill it with disks. Of course, the converse to that was that AFS just couldn’t (still can’t) scale up in terms of IOPS, notwithstanding the benefit it should get from client-side caching; the current OpenAFS fileserver is said to have internal bottlenecks that prevent it from using more than about eight threads effectively. So in this era of “big data”, lots of people want NFS storage for their research data, and AFS is relegated to smaller shared services with more limited performance requirements, like individual home directories, Web sites, and source-code repositories.

In part 1, and previously on my original Web page, I described a configuration for a big file server that we have deployed several of at work. (Well, for values of “several” equal to “three”, or “five” if you include the mirror-image hardware we have installed but not finished deploying just yet as a backup.) One of the research groups we support wanted more temporary space than we would be able to allocate on our shared servers, and they were willing to pay for us to build a new server for their exclusive use. I asked iXsystems for a budgetary quote, and we’re actually going forward with the purchasing process now.

If you read my description from January, you’ll recall that we have 96 disks in this configuration, but four of them are SSDs (used for the ZFS intent log and L2ARC), and another four (one for each drive shelf) are hot spares. Thus, there are 88 disks actually used for storage. On our scratch server, we have these configured in mirror pairs, and on the other servers, we are using 8-disk RAID-Z2 — in both cases, the vdevs are spread across disk shelves, so that we can survive even a failure of an entire shelf.) That gives us the following capacities, after accounting for overhead:

Storage array usable capacity
Configuration Drive size Aggregate Usable
mirror 2 TB 80 TiB 76.1 TiB
mirror 3 TB 120 TiB 114.2 TiB
RAID-Z2 2 TB 160 TiB 117.4 TiB
RAID-Z2 3 TB 239 TiB 176 TiB

The quote that we got from iXsystems was for a system with more memory and faster processors than the servers that Quanta donated in 2012, and has 3 TB disks. All told, with a three-year warranty, it comes in at under $60,000. For this group, we’ll be deploying a “scratch” (mirrored) configuration, so that works out to be under $512/TiB, which is amazingly good for a high-speed file server with buckets of SSD cache. That’s about 47 cents per terabyte-day, assuming a useful life of three years, and in reality we usually get closer to five years. (Of course, that does not include the cost of the rack space, network connectivity, power, and cooling, all of which are sunk costs for us.) In the “production” (RAID-Z2) configuration, the cost looks even better: $341/TiB or 31 cents/TiB*d. (Of course, we’d like to have a complete mirror image of a production system, which would double that price.)

This raises an interesting question: at what point, if at all, does it make sense to build our AFS servers around a similar hardware architecture? Given the OpenAFS scaling constraints, might it even make sense to export zvols to the AFS servers over iSCSI? A fairly random Dell R620 configuration I grabbed from our Dell Premier portal (not the model or configuration we would normally buy for an AFS server, but an easy reference) comes in at nearly $960/TiB! (Nearly 88 cents per TiB*d.) Because of various brokenness in the Dell portal, I wasn’t able to look at servers with 3.5″ drive bays, which would significantly reduce the price — but not down to $341/TiB. I think the only way to get it down that low with Dell hardware is to buy a minimal-disk server with a bunch of empty drive bays, then buy salvage Dell drive caddies (thankfully they haven’t changed the design much) and fill the empty slots with third-party drives. Even if you do that, however, I think you still can’t amortize the cost of the host over enough drives to make it competitive on a per-TiB basis.

For now, we’ll be sticking with our existing AFS servers, but this will be a matter to consider seriously when we have our next replacement cycle.