Over the past couple of years, I’ve had the chance to build a number of big file servers for work. Back in January, I wrote a description of the server hardware we used and some of the software configuration that was required to make it go. Since we’re in the process of actually buying one of these things for the first time (the initial hardware was donated), I figured it was time for an update.
When I first built these servers, I used FreeBSD 9.1 as the base operating system, primarily due to the combination of familiarity and ZFS support. ZFS is a big win for servers on this scale. I had hoped that FreeBSD 9.2 would be out by the summer, and we could test and deploy a new release fairly easily, but that still hasn’t happened; summer break ended three weeks ago, and with it my opportunity to test an updated software stack. As it turns out, most of the stuff that we might have cared about from 9.2 is already in my patched 9.1, and some of the patches that matter the most didn’t make it into the 9.2 release cycle at all.
The servers are all Puppetized, although some issues with Puppet have limited my ability to control as much of the configuration that way as I would have liked, and support for FreeBSD in non-core Puppet modules is still very limited. (Many Puppet modules that I’ve run across also have conceptual or data-model problems that limit their portability.)
One issue we discovered fairly early on was with the driver for the Intel 10-Gbit/s Ethernet controller. This turned out to be a misunderstanding on the part of the Intel people over memory allocation. (Specifically, for an interface with jumbo frames configured, as all but one of our servers have, they would try to allocate a “16k jumbo” buffer, which requires four contiguous pages of memory. The controller’s DMA engine has no problem doing scatter-gather across multiple physical pages, so the right thing — and the fix that I applied — was simply never to allocate more than a page-sized buffer, which will always be possible whenever there is any memory available at all.) Debugging this issue at least got me to write a munin
plugin for tracking kernel memory allocations, which has proved to be useful more generally.
Once I fixed the ixgbe
driver, the next issue we ran into was the fact that 96 GB just isn’t quite enough for a big ZFS file server, at least on FreeBSD. This is due in large part to the way ZFS manages memory: it requires wired kernel virtual memory for the entire Adaptive Replacement Cache (ARC), and while it tries to grow and shrink its allocation in response to demand, it often doesn’t succeed (or doesn’t succeed quickly enough) to avoid memory-exhaustion deadlock. (By contrast, the standard FreeBSD block cache is unified at a low level with the VM system, and stores cached data in unmapped — physical — pages, rather than wired virtual pages.) We found that our nightly backup jobs were particularly painful, as the backup system traversed large amounts of otherwise cold metadata in some very large filesystems. We ended up limiting the ARC to 64 GB of memory, which leaves just enough memory for NFS and userland applications.
gmultipath
is a bit of a sore point. It does exactly what it is supposed to in the case of an actual path failure — I tested this under load by pulling a SAS cable — but it does totally the wrong thing when the hardware reports a medium error. gmultipath
appears to have no way to distinguish these errors (it may be implemented at the wrong layer to be able to do so), so it just continually retries the request, alternating paths, until someone notices that the server is really slow and checks out the console message buffer to see what went wrong. At least it does allow us to label the drive so we know which one is bad, but it would be better if it were built into CAM, which actually has sufficient topology information to do it right and can distinguish between path and medium errors before both get turned into [EIO] heading into GEOM. (This is particularly bad for disks that default to retrying every failed request for 30 seconds or more — and even modern “enterprise” drives seem to come that way.)
The overall performance of the system is quite good, although not yet as good as I would like. For data that fits in the ARC, the tweaked NFS server (with Rick Macklem’s patches) can do close to line rate on the 10-Gbit/s interface, which is better than any of our clients (which are limited to 1 Gbit/s for the most part). Operations that hit disk are still a bit slower than I think they should be, but the bottleneck is clearly on the host side, as the disks are clearly loafing. I’m guessing that there are some kernel bottlenecks yet to be addressed even after fixing the NFS server’s replay cache.
In part 2, I’ll look at how much it costs to build one of these things.