At CSAIL, we support two distributed file systems: AFS, which has a sensible security model (if somewhat dated in its implementation), but is slow and limited in storage capacity, and NFS, for people who don’t want security and do need either speed or high capacity (tens of terabytes in the worst case, as compared with hundreds of gigabytes for AFS). Historically, AFS has also been much cheaper, and scaled out much better, with NFS requiring expensive dedicated hardware to perform reasonably well, whereas AFS could run on J. Random Server (these days, Dell PowerEdge rackmounts, because they have great rails and a usable [SSH, not IPMI] remote console interface) with internal drives and when we needed more capacity, we could just buy another server and fill it with disks. Of course, the converse to that was that AFS just couldn’t (still can’t) scale up in terms of IOPS, notwithstanding the benefit it should get from client-side caching; the current OpenAFS fileserver is said to have internal bottlenecks that prevent it from using more than about eight threads effectively. So in this era of “big data”, lots of people want NFS storage for their research data, and AFS is relegated to smaller shared services with more limited performance requirements, like individual home directories, Web sites, and source-code repositories.
In part 1, and previously on my original Web page, I described a configuration for a big file server that we have deployed several of at work. (Well, for values of “several” equal to “three”, or “five” if you include the mirror-image hardware we have installed but not finished deploying just yet as a backup.) One of the research groups we support wanted more temporary space than we would be able to allocate on our shared servers, and they were willing to pay for us to build a new server for their exclusive use. I asked iXsystems for a budgetary quote, and we’re actually going forward with the purchasing process now.
If you read my description from January, you’ll recall that we have 96 disks in this configuration, but four of them are SSDs (used for the ZFS intent log and L2ARC), and another four (one for each drive shelf) are hot spares. Thus, there are 88 disks actually used for storage. On our scratch server, we have these configured in mirror pairs, and on the other servers, we are using 8-disk RAID-Z2 — in both cases, the vdevs are spread across disk shelves, so that we can survive even a failure of an entire shelf.) That gives us the following capacities, after accounting for overhead:
|mirror||2 TB||80 TiB||76.1 TiB|
|mirror||3 TB||120 TiB||114.2 TiB|
|RAID-Z2||2 TB||160 TiB||117.4 TiB|
|RAID-Z2||3 TB||239 TiB||176 TiB|
The quote that we got from iXsystems was for a system with more memory and faster processors than the servers that Quanta donated in 2012, and has 3 TB disks. All told, with a three-year warranty, it comes in at under $60,000. For this group, we’ll be deploying a “scratch” (mirrored) configuration, so that works out to be under $512/TiB, which is amazingly good for a high-speed file server with buckets of SSD cache. That’s about 47 cents per terabyte-day, assuming a useful life of three years, and in reality we usually get closer to five years. (Of course, that does not include the cost of the rack space, network connectivity, power, and cooling, all of which are sunk costs for us.) In the “production” (RAID-Z2) configuration, the cost looks even better: $341/TiB or 31 cents/TiB*d. (Of course, we’d like to have a complete mirror image of a production system, which would double that price.)
This raises an interesting question: at what point, if at all, does it make sense to build our AFS servers around a similar hardware architecture? Given the OpenAFS scaling constraints, might it even make sense to export zvols to the AFS servers over iSCSI? A fairly random Dell R620 configuration I grabbed from our Dell Premier portal (not the model or configuration we would normally buy for an AFS server, but an easy reference) comes in at nearly $960/TiB! (Nearly 88 cents per TiB*d.) Because of various brokenness in the Dell portal, I wasn’t able to look at servers with 3.5″ drive bays, which would significantly reduce the price — but not down to $341/TiB. I think the only way to get it down that low with Dell hardware is to buy a minimal-disk server with a bunch of empty drive bays, then buy salvage Dell drive caddies (thankfully they haven’t changed the design much) and fill the empty slots with third-party drives. Even if you do that, however, I think you still can’t amortize the cost of the host over enough drives to make it competitive on a per-TiB basis.
For now, we’ll be sticking with our existing AFS servers, but this will be a matter to consider seriously when we have our next replacement cycle.
I’d be interested to know if you’ve looked a distributed storage solutions, like Ceph or Gluster?
We have looked at them. Unfortunately, most of them were not obviously designed by someone who had a clue about security (a minimum requirement, if we were to move away from NFS, would be a security model that was at least no worse, and some of them are), and there are various other issues surrounding vendor lock-in, support for administrative tools, and overall maturity that have made us uninterested in the current state of these systems. A particular sore spot for us is that many of these distributed filesystems assume much tighter administrative integration than I can ever guarantee (in a research lab with 600 graduate students, all of whom have root on their machines); it’s difficult enough dealing with the UID/GID issue across five different platforms (managed Debian/Ubuntu, unmanaged Ubuntu, Mac OS, Windows, and FreeBSD).
Actually we (I) have looked breifly at ceph and it’s very interesting with a multi tenant security model. Though I wish it were actually Kerberos rather than “Kerberos-Like”.
For the ceph case the biggest blocker is finding the time to really test adn evaluate something so novel to our environment. I’ve had a couple of ceph pools for teting at various times but somehting always pulls me a way before I’m done.
Given that caveat, whene I was looking the filesystem component was known to be slow because it had yet to really have optimization work on it, that may be different now. The oject store and rbd interfaces were (and are) interesting for backending our OpenStack deploy but are sufficiently different than a network filesystem it really wouldn’t be a replacement for our existing user base.