A few hours ago, I sent the following “dear colleagues” email (lightly edited to remove some private details) to all my users at work:
This has been a very trying week. For those of you whose work was disrupted this week by unplanned network outages, my deepest apologies. I am writing to you to explain what we know so far about the cause of these problems, what we have done to resolve them, and
what actions still remain to be taken.
Over the course of the summer, we have been suffering from a variety of difficult-to-identify network problems. Our network switches have been observed using far more CPU than has historically been the case, we have had a variety of packet storms that appear to have been caused by forwarding loops despite the fact that we run a protocol designed to prevent such loops from taking place, and we have had a variety of unexplained switch crashes.
Starting very early Wednesday morning, the switch that serves the server room 32-399 began to crash unexpectedly, and in a way that (contrary to design) requires a human to physically power-cycle it in order to restore service. This switch crash affected a variety of research groups’ systems, as well as central AFS servers and the newest set of OpenStack hypervisors. We initially thought that this was a consequence of the power failure we experienced on Tuesday evening, and the vendor (Juniper Networks) suggested a reinstallation procedure for the member nodes in this switch to recover from possible corruption of the flash media from which the switch member nodes boot. Unfortunately, this did not resolve the problem, although it did put off the recurrence for a few hours.
On Thursday morning, an engineer from Juniper came to Stata to help us determine the cause of that crash and come up with a resolution. He was assisted by five other Juniper engineers from their Advanced Technical Assistance Center via teleconference. It took the rest of the day Thursday to come up with an action plan to resolve the issue (although still without any identifiable root cause), because the failing switch nodes were unresponsive on their console as well as the network, and none of the more obvious fixes we tried had any effect. (The switch logs, on reboot, stopped just before the crash.) Eventually we decided to upgrade this one switch to the latest version of Juniper’s firmware for this platform (EX4200), and simultaneously to make several configuration changes to reduce the use of code paths which are less well tested. This, in combination with what I’m about to explain next, appears to have resolved the issues with this particular switch. I will be monitoring this switch over the weekend to make sure it remains stable. Juniper has dispatched replacement hardware for all three member nodes of this switch, in case it proves necessary, but at this point we believe that the problem was caused by software and not a hardware failure (and thus, the association with the power outage was a red herring).
Over the summer we have been experiencing a variety of issues with our core switches. We continued to experience these issues even after upgrading the firmware as recommended by Juniper, which we did (on an emergency, unscheduled basis) two Mondays ago. The most serious issue is that there appear to be bridge loops occasionally being created which saturate the CPU on the core switches, thereby preventing them from running necessary protocol processing. I believe (but don’t have official confirmation yet) that this can result in a priority inversion relative to flooding broadcast and multicast traffic through the network, such that the process that implements the Spanning-Tree Protocol (which constructs a loop-free topology by deactivating redundant links) is unable to run, which causes all of the access switches to think that their uplink is no longer connected to a spanning-tree bridge, which causes Sorceror’s Apprentice packet amplification as multicasts are then forwarded down both core links simultaneously — which only adds to the overload on the core switches. Sometimes the core switches would recover from this condition without human intervention, but several times they did not, and I was forced to physically power-cycle one of them to break the loop.
We are trying to develop some configuration changes that will make this less likely in the future, by changing some of the assumptions in the Spanning-Tree Protocol to limit the possibility of bridge loops forming in the first place. This work has yet to be done, so in the mean time, we have made some changes to our network to reduce the CPU load on the core switches and make it less likely that the spanning-tree process can get starved in the first place.
The principal change that I have made in this regard is to disable IPv6 on most CSAIL networks. I have come to the conclusion that so much in IPv6 design and implementation has been botched by protocol designers and vendors (both ours and others) that it is simply unsafe to run IPv6 on a production network except in very limited geographical circumstances and with very tight central administration of hosts.
The fundamental design problem with IPv6 is related to how it functions over shared layer-2 networks like Ethernet. In IPv4, there is a separate protocol (ARP) which is used by hosts to find the MAC address of other stations on the network. To do this, a host that wants to send a packet sends a layer-2 broadcast frame asking “who has IP address 128.30.xxx.yyy?” The intended recipient, if it’s on the network, sends a reply that says “Hey, I’m 128.30.xxx.yyy, and my MAC address (in case you missed it) is 01:23:45:67:89:ab!” If the intended recipient isn’t on the network, the sender keeps on broadcasting periodically until it either gets a response or gives up.
In IPv6, this is not a separate protocol; it’s called “neighbor discovery”, and uses ICMPv6 packets. Because IPv6 does not have broadcasts, the soliciation packets must be sent to a multicast address. But rather than use the standard “all hosts” multicast address, the Neighbor Discovery Protocol instead specifies that every host must join another multicast group, one of 4 billion distinct “solicited node” multicast groups chosen as a function of the system’s IPv6 address, and Neighbor Discovery packets are sent to this group rather than being broadcast. (This means that the vast majority of all IPv6 multicast groups in use anywhere in the universe have a single member.)
In theory, that should be no worse than sending an ARP broadcast, but in practice it is much worse, because IPv6 systems must also implement the Multicast Listener Discovery protocol, by which all stations on the network report, when requested, all of the multicast groups they are members of — and they send these reports to the group, which means flooding those reports throughout the network, because the network switches have no way of knowing Ethernet multicast addresses are desired on which ports. Furthermore, MLD packets are required by the protocol to be transmitted with a “router alert” option, which causes routers to bounce these packets from hardware forwarding into software, meaning that while the flooding of ARP broadcasts can be fast-pathed and are usually implemented in silicon, MLD listener report multicasts must be slow-pathed — and since IPv6 is still not implemented on very large layer-2 networks like a campus network or our building, MLD processing is generally poorly tested and not optimized by network vendors. Our core switches send MLD all-groups queries every 150 seconds, and direct end hosts to splay their responses over a 10-second interval, both as recommended in the MLD RFC.
In theory, we could implement “MLD snooping” across our network to reduce the overhead of flooding MLD listener report packets all over. However, this is very new and raw code, at least in Juniper’s operating system, and not well exercised anywhere that I’m aware of. Even if we did this (and we tried), that would require at least two entries in the multicast zone of every switch’s TCAM (hardware forwarding table) for every host on every IPv6-enabled network in CSAIL, just to handle the Neighbor Discovery multicast groups — and in our switches, the entire TCAM space dedicated to IPv6 multicast is only 3,000 entries. That would be just barely enough to support all of CSAIL, but for one major issue: “privacy” addresses.
IPv6 “privacy” addresses are an incredible botch added to IPv6 a few years ago to simulate the “good old days” of dial-up Internet where machines changed their IP (v4) addresses all the time. In normal IPv6 usage, every host generates an IPv6 address by listening for broadcasts from routers telling them what network prefix (the first 64 bits of the IPv6 address) to use, and appending to that a fixed function of their network interface card’s MAC address. In “privacy” addresses, the host simply generates a 48-bit pseudorandom number, pretends that it’s a MAC address, and applies the same function. The host will also maintain a “traditional” IPv6 address, used only for incoming connections — the random address is used for outgoing packets. What’s worse, the random address is changed regularly, typically daily, but the old random addresses are kept around for a fairly long time, on the order of a week, in case some other host out there on the network wants to “call back” after a new address has been generated. Thus, a typical machine — say, an Ubuntu 14.04 workstation with the default configuration — will end up claiming eight IPv6 addresses at any one time. That means nine IPv6 multicast groups, which means that those 3,000 TCAM entries can be exhausted by as few as 333 Ubuntu workstations.
This is generally not an issue for portable machines like laptops, because they forget all their old random addresses whenever their IPv6 prefix changes (which happens whenever they are disconnected from one IPv6-capable network and connected to another) but it is a very serious issue for workstations and, of course, servers, that are connected full-time to a single network. (The random addresses are also very problematic for me as a network administrator, because they mean that I am unable to trace back problem machines to their owners if they have been removed even briefly from the network in the interim.)
I used Ubuntu as an example, but it is hardly the worst offender. We have seen Windows machines with more than 300 IPv6 addresses — which, recall, means that every 150 seconds they will be transmitting 30 multicast packets per second which have to be flooded through the network. That problem was caused by a broken Intel NIC driver — Windows attempts to offload IPv6 processing to the NIC while the workstation is in standby mode, to support wake-on-LAN over IPv6 — but we had to get an updated driver from Intel to fix the problem (Windows Update was still distributing the broken driver). We’ve seen other machines that merely flood the network with copies of a single MLD listener report — sometimes hundreds of packets in less than a second. I first learned about this back in July when we started to observe CPU overload on the switches that serve our area; it turned out to be one of the public Windows machines outside the TIG offices, but once we knew what to look for, we saw many other machines doing the same thing. (Pretty much any new Dell workstation running Windows has this bug, unless the Intel Ethernet driver has been updated.)
For this reason, we will not be making IPv6 more broadly available again until we have a reliable way of ensuring that “privacy” addresses are disabled on all clients. (They are already disabled in CSAIL Ubuntu, but there are of course many other clients that we can’t control centrally.) We will probably move towards not supporting IPv6 “stateless” autoconfiguration at all, and rely on DHCPv6 to assign stable, traceable addresses to all IPv6 clients, but we’re a ways away from being able to implement that at this time.
I finally got home last night around 1 AM, after being awake for nearly 40 hours. If I never see another dawn in the office again, it will be too soon.
UPDATE (2014-09-06): As Stéphane Bortzmeyer was the first to point out, RFC 7217 addresses all of my issues with “privacy” addresses. Let implementation come soon!
CORRECTION (2014-09-06): There are actually only 16,777,216 “solicited node” multicast groups, not 4 billion (as I posted here). I originally said 65,536 in my email, but realized that was wrong, and miscorrected it when I posted here.