The network nightmare that ate my week

A few hours ago, I sent the following “dear colleagues” email (lightly edited to remove some private details) to all my users at work:

This has been a very trying week. For those of you whose work was disrupted this week by unplanned network outages, my deepest apologies. I am writing to you to explain what we know so far about the cause of these problems, what we have done to resolve them, and
what actions still remain to be taken.

Over the course of the summer, we have been suffering from a variety of difficult-to-identify network problems. Our network switches have been observed using far more CPU than has historically been the case, we have had a variety of packet storms that appear to have been caused by forwarding loops despite the fact that we run a protocol designed to prevent such loops from taking place, and we have had a variety of unexplained switch crashes.

First issue

Starting very early Wednesday morning, the switch that serves the server room 32-399 began to crash unexpectedly, and in a way that (contrary to design) requires a human to physically power-cycle it in order to restore service. This switch crash affected a variety of research groups’ systems, as well as central AFS servers and the newest set of OpenStack hypervisors. We initially thought that this was a consequence of the power failure we experienced on Tuesday evening, and the vendor (Juniper Networks) suggested a reinstallation procedure for the member nodes in this switch to recover from possible corruption of the flash media from which the switch member nodes boot. Unfortunately, this did not resolve the problem, although it did put off the recurrence for a few hours.

On Thursday morning, an engineer from Juniper came to Stata to help us determine the cause of that crash and come up with a resolution. He was assisted by five other Juniper engineers from their Advanced Technical Assistance Center via teleconference. It took the rest of the day Thursday to come up with an action plan to resolve the issue (although still without any identifiable root cause), because the failing switch nodes were unresponsive on their console as well as the network, and none of the more obvious fixes we tried had any effect. (The switch logs, on reboot, stopped just before the crash.) Eventually we decided to upgrade this one switch to the latest version of Juniper’s firmware for this platform (EX4200), and simultaneously to make several configuration changes to reduce the use of code paths which are less well tested. This, in combination with what I’m about to explain next, appears to have resolved the issues with this particular switch. I will be monitoring this switch over the weekend to make sure it remains stable. Juniper has dispatched replacement hardware for all three member nodes of this switch, in case it proves necessary, but at this point we believe that the problem was caused by software and not a hardware failure (and thus, the association with the power outage was a red herring).

Second issue

Over the summer we have been experiencing a variety of issues with our core switches. We continued to experience these issues even after upgrading the firmware as recommended by Juniper, which we did (on an emergency, unscheduled basis) two Mondays ago. The most serious issue is that there appear to be bridge loops occasionally being created which saturate the CPU on the core switches, thereby preventing them from running necessary protocol processing. I believe (but don’t have official confirmation yet) that this can result in a priority inversion relative to flooding broadcast and multicast traffic through the network, such that the process that implements the Spanning-Tree Protocol (which constructs a loop-free topology by deactivating redundant links) is unable to run, which causes all of the access switches to think that their uplink is no longer connected to a spanning-tree bridge, which causes Sorceror’s Apprentice packet amplification as multicasts are then forwarded down both core links simultaneously — which only adds to the overload on the core switches. Sometimes the core switches would recover from this condition without human intervention, but several times they did not, and I was forced to physically power-cycle one of them to break the loop.

We are trying to develop some configuration changes that will make this less likely in the future, by changing some of the assumptions in the Spanning-Tree Protocol to limit the possibility of bridge loops forming in the first place. This work has yet to be done, so in the mean time, we have made some changes to our network to reduce the CPU load on the core switches and make it less likely that the spanning-tree process can get starved in the first place.

The principal change that I have made in this regard is to disable IPv6 on most CSAIL networks. I have come to the conclusion that so much in IPv6 design and implementation has been botched by protocol designers and vendors (both ours and others) that it is simply unsafe to run IPv6 on a production network except in very limited geographical circumstances and with very tight central administration of hosts.

Technical details

The fundamental design problem with IPv6 is related to how it functions over shared layer-2 networks like Ethernet. In IPv4, there is a separate protocol (ARP) which is used by hosts to find the MAC address of other stations on the network. To do this, a host that wants to send a packet sends a layer-2 broadcast frame asking “who has IP address” The intended recipient, if it’s on the network, sends a reply that says “Hey, I’m, and my MAC address (in case you missed it) is 01:23:45:67:89:ab!” If the intended recipient isn’t on the network, the sender keeps on broadcasting periodically until it either gets a response or gives up.

In IPv6, this is not a separate protocol; it’s called “neighbor discovery”, and uses ICMPv6 packets. Because IPv6 does not have broadcasts, the soliciation packets must be sent to a multicast address. But rather than use the standard “all hosts” multicast address, the Neighbor Discovery Protocol instead specifies that every host must join another multicast group, one of 4 billion distinct “solicited node” multicast groups chosen as a function of the system’s IPv6 address, and Neighbor Discovery packets are sent to this group rather than being broadcast. (This means that the vast majority of all IPv6 multicast groups in use anywhere in the universe have a single member.)

In theory, that should be no worse than sending an ARP broadcast, but in practice it is much worse, because IPv6 systems must also implement the Multicast Listener Discovery protocol, by which all stations on the network report, when requested, all of the multicast groups they are members of — and they send these reports to the group, which means flooding those reports throughout the network, because the network switches have no way of knowing Ethernet multicast addresses are desired on which ports. Furthermore, MLD packets are required by the protocol to be transmitted with a “router alert” option, which causes routers to bounce these packets from hardware forwarding into software, meaning that while the flooding of ARP broadcasts can be fast-pathed and are usually implemented in silicon, MLD listener report multicasts must be slow-pathed — and since IPv6 is still not implemented on very large layer-2 networks like a campus network or our building, MLD processing is generally poorly tested and not optimized by network vendors. Our core switches send MLD all-groups queries every 150 seconds, and direct end hosts to splay their responses over a 10-second interval, both as recommended in the MLD RFC.

In theory, we could implement “MLD snooping” across our network to reduce the overhead of flooding MLD listener report packets all over. However, this is very new and raw code, at least in Juniper’s operating system, and not well exercised anywhere that I’m aware of. Even if we did this (and we tried), that would require at least two entries in the multicast zone of every switch’s TCAM (hardware forwarding table) for every host on every IPv6-enabled network in CSAIL, just to handle the Neighbor Discovery multicast groups — and in our switches, the entire TCAM space dedicated to IPv6 multicast is only 3,000 entries. That would be just barely enough to support all of CSAIL, but for one major issue: “privacy” addresses.

IPv6 “privacy” addresses are an incredible botch added to IPv6 a few years ago to simulate the “good old days” of dial-up Internet where machines changed their IP (v4) addresses all the time. In normal IPv6 usage, every host generates an IPv6 address by listening for broadcasts from routers telling them what network prefix (the first 64 bits of the IPv6 address) to use, and appending to that a fixed function of their network interface card’s MAC address. In “privacy” addresses, the host simply generates a 48-bit pseudorandom number, pretends that it’s a MAC address, and applies the same function. The host will also maintain a “traditional” IPv6 address, used only for incoming connections — the random address is used for outgoing packets. What’s worse, the random address is changed regularly, typically daily, but the old random addresses are kept around for a fairly long time, on the order of a week, in case some other host out there on the network wants to “call back” after a new address has been generated. Thus, a typical machine — say, an Ubuntu 14.04 workstation with the default configuration — will end up claiming eight IPv6 addresses at any one time. That means nine IPv6 multicast groups, which means that those 3,000 TCAM entries can be exhausted by as few as 333 Ubuntu workstations.

This is generally not an issue for portable machines like laptops, because they forget all their old random addresses whenever their IPv6 prefix changes (which happens whenever they are disconnected from one IPv6-capable network and connected to another) but it is a very serious issue for workstations and, of course, servers, that are connected full-time to a single network. (The random addresses are also very problematic for me as a network administrator, because they mean that I am unable to trace back problem machines to their owners if they have been removed even briefly from the network in the interim.)

I used Ubuntu as an example, but it is hardly the worst offender. We have seen Windows machines with more than 300 IPv6 addresses — which, recall, means that every 150 seconds they will be transmitting 30 multicast packets per second which have to be flooded through the network. That problem was caused by a broken Intel NIC driver — Windows attempts to offload IPv6 processing to the NIC while the workstation is in standby mode, to support wake-on-LAN over IPv6 — but we had to get an updated driver from Intel to fix the problem (Windows Update was still distributing the broken driver). We’ve seen other machines that merely flood the network with copies of a single MLD listener report — sometimes hundreds of packets in less than a second. I first learned about this back in July when we started to observe CPU overload on the switches that serve our area; it turned out to be one of the public Windows machines outside the TIG offices, but once we knew what to look for, we saw many other machines doing the same thing. (Pretty much any new Dell workstation running Windows has this bug, unless the Intel Ethernet driver has been updated.)

For this reason, we will not be making IPv6 more broadly available again until we have a reliable way of ensuring that “privacy” addresses are disabled on all clients. (They are already disabled in CSAIL Ubuntu, but there are of course many other clients that we can’t control centrally.) We will probably move towards not supporting IPv6 “stateless” autoconfiguration at all, and rely on DHCPv6 to assign stable, traceable addresses to all IPv6 clients, but we’re a ways away from being able to implement that at this time.

I finally got home last night around 1 AM, after being awake for nearly 40 hours. If I never see another dawn in the office again, it will be too soon.

UPDATE (2014-09-06): As Stéphane Bortzmeyer was the first to point out, RFC 7217 addresses all of my issues with “privacy” addresses. Let implementation come soon!

CORRECTION (2014-09-06): There are actually only 16,777,216 “solicited node” multicast groups, not 4 billion (as I posted here). I originally said 65,536 in my email, but realized that was wrong, and miscorrected it when I posted here.

This entry was posted in Computing and tagged , , , , . Bookmark the permalink.

64 Responses to The network nightmare that ate my week

  1. Francisco Obispo says:

    Why do you have such a large broadcast (L2) domain? It seems like most of the problems that you’re experiencing could be reduced significantly if you had smaller domains with a routing hierarchy.

  2. Idontknowanything says:

    That sounds horrible … But it makes sense that’s why I keep ipv6 disabled on my own comp. disabling it will speed up your connection because you eliminate one of the biggest problems with wireless multicast … The constant battle of whether to use ipv6 or ipv4.

  3. Timh says:

    Stateless Auto-Configuration is the devil, I recommend everyone to kill it completely on all your networks. Use statically assigned IPv6 addresses the same way as you assign and configure IPv4 and if you must use some form of automatic protocol, go to DHCPv6. Whenever “automagic” is involved you’ll end up sad and tired. Just my 2c.

    • Frank says:

      Too bad that the MAC address isn’t visible in DHCPv6 packets, making it impossible as things stand to apply equivalent behavior to the v4 and v6 worlds. Everyone bug your router vendors to support RFC6939!

    • Phil Karn says:

      Stateless autoconfiguration is one of the most elegant and useful features of IPv6. It works fine for me, and it greatly reduces my workload as network administrator by not having to assign and maintain static addresses or install and maintain a DHCP server and associated kludges like DHCP relay.


  4. Wasn’t clear if you were soliciting for information on how to disable privacy extensions for Windows, but as a heads up (esp for others) you can do so via:

    netsh interface ipv6 set privacy state=disabled store=persistent (saved configuration)
    netsh interface ipv6 set privacy state=disabled store=active (running configuration)

    And confirm via:
    netsh interface ipv6 show privacy

    • Was well aware of that — but we don’t control most Windows machines on our network, nor can be we certain that they’re running non-buggy IP-offload drivers.

  5. JohnG says:

    IPv6 was intended to be used with large broadcast domains so you *should* be able to do this if it weren’t for the bugs/poor implementation. I have my own doubts about the design. We have IPv6 enabled (dual stack) across the organisation with a couple of our larger subnets having around 1000 hosts on each. We use DHCPv6 extensively (SLAAC disabled everywhere). You need to make sure your routers have auto-configure disabled in the Router Advertisements that they send out. If you don’t, clients get a DHCPv6 addresses AS WELL AS autoconfiguring a temporary and permanent address which would just add to your pain. If you want to control which devices are allowed on your network or what address you want them to have then collecting the DUIDs can be a major hurdle. Having to build a system in order to collect the DUID before you can connect the machine to the network is a nuisance. Also trying to find DHCP server software that supports allocation by DUID+IAID, allows the same combination on different subnets and logs details that can be used to debug problems is a challenge we have yet to overcome.

    • We only just upgraded to a version of JunOS that supports DHCPv6 relay, so we’re still a bit of a ways out on being able to do this, but yes, I think getting rid of SLAAC is going to be part of the way forward for us. But I also have to rewrite some of our backend systems (which are still IPv4 only) to be able to export the right bits to DHCPv6, and I’m hoping that I can configure my access switches in a way that disallows non-DHCP IPv6 clients on untrusted interfaces. Since we’re not lacking in IPv4 space, we’ll probably just use the IPv4 assignment as the IID for static IPv6 — that way, it’s one less number to manage.

      • Phil Karn says:

        It is of course your call whether to use DHCPv6 and/or SLAAC on your networks; both are defined to give you that choice. But remember that SLAAC was created specifically to simplify life for the network administrator by eliminating the requirement to always install and maintain a DHCP server and its databases on even the smallest network. I for one consider SLAAC one of the more elegant design features in IPv6.

        If I need to track down a misbehaving host, I can still do it with the L2 MAC address. But such unique identifiers simply do not belong in L3 addresses that go around the world and are mined on a vast scale by the NSA and who knows who else.

        Remember that one of the main reasons for IPv6 is to eliminate NATs, but incredible botch they may be, they do provide some inherent privacy by partly hiding the actual identity of a client host from the larger Internet. IPv6, by design, exposes a globally unique address for every client so steps need be taken to keep those unique addresses from easily identifying the client for any reason but delivering replies to its packets in real time.

  6. Tristan Rhodes says:

    I am not sure if this is related, but it made me think about it because we are currently dealing with this issue:

    We noticed high CPU on many of our network devices, and a packet-capture from any host showed tens of thousands of MLD packets. The fix is to upgrade the driver software on the Intel NICs, which has solved the problem for us.

    • I just re-read the post and see you already identified this issue. I wonder how many networks have this problem, but they lack the tools (monitoring CPU) or skillset to identify and correct it!

  7. Phil Karn says:

    IPv6 privacy addressing is a feature, not a bug. Snowden’s revelations should have made the utility of this feature obvious.

    IPv6 neighbor discovery with subnet multicast, not broadcasts, is a feature, not a bug. In fact, the level 2 people probably lobbied the strongest to avoid the use of subnet broadcasts as much as possible because of the inherent scaling problems. Make up your minds…!

    IPv6 is not responsible for poorly designed or implemented level 2 networks. Fix them.

    Note: I am the editor and primary author of RFC3819, Advice for Subnetwork Designers.

    • As I mention, the new RFC 7217 method for generating stable non-MAC-based addresses would resolve most if not all of my concerns. So would CGA, although that doesn’t seem to have been implemented by anyone, which is unfortunate.

      • Zamolxes says:

        Garrett, I really think you need to read more about IPv6. CGA is implemented by some but not to the extent it is deployable in a multivendor environment. And I am failing to understand how CGA, an antispoofing technique helps with your problems, not to mention that CGA requires more resources from the system when you already complained about NDP.

        IPv6 can be made a scapegoat only until people who understand it will call you on it. I would recommend learning more about it and then redesigning your network by taking full advantage of it.

      • Any system that results in end systems having a single, long-term stable, network address, whether it be 7217 or CGA or DHCPv6 or even traditional MAC-based SLAAC, works for me. NDP is not a resource drain in and of itself: it’s the multicast groups that are the problem, and that’s mainly because of MLD (because switches can obviously just flood unknown multicast groups just as they do broadcasts, but they have to actually process AND forward hundreds of MLD packets every 150 seconds).

        And no, I am not going to “redesign my network” to take advantage of IPv6. My users won’t tolerate that, and wouldn’t pay for that even if they thought it was a good idea (which all but one client is totally indifferent to). I’m pissed because I tried to do the right thing with the resources available to me, and the protocol designers and implementers let me down.

      • Phil Karn says:

        RFC7217 solves many of the privacy and security problems with using a MAC address as a stable IPv6 address. But RFC7217 itself points out that it is *not* a substitute for random, time-varying IPv6 “privacy” addresses. Only they can thwart correlation of a stationary host’s activities across time (see the last paragraph of section 8, Security Considerations).

        Your only truly valid complaint is that some IPv6 implementations (more specifically, NIC firmware and/or drivers) have bugs that caused them to maintain hundreds of temporary addresses per host. This is hardly a reason to condemn IPv6. If you want to jump on Microsoft or Intel, be my guest. Remember: random, time-varying IPv6 addresses are a feature, not a bug.

        What really baffles me is your statement that you’re happy with how IPv4 ARP works: flooding every ARP packet throughout the entire layer 2 network with MAC broadcasts, with no easy way to limit their scope. So why are you even bothering with multicast scoping if you don’t need to limit multicast traffic? Saving a few kilobits/sec or even megabits/sec on a campus LAN doesn’t seem like a wise trade if it invokes buggy slow paths in switch code. Go jump on Juniper too, be my guest.

        There’s nothing actually wrong with IPv6 here; we (the IETF) designed it this way for good reasons that are explained at length in the various standards RFCs. If you disagree I recommend joining the IETF, making constructive suggestions and writing RFCs of your own.

      • Of course, being able to correlate a host’s activities from day to day is actually an administrative requirement. When I get the first IPv6 DMCA takedown notice, I have to be able to find that user and let them know the copyright industry has noticed them, because I don’t want to lose upstream connectivity for my whole lab. Likewise when the first report of abuse of some publisher’s journal database.

        ARP implementations are, generally speaking, better tested and more robust than NDP implementations, and the fact that it uses broadcast actually helps to keep wayward vendors in line. Switch vendors expect that customers, if they use multicast at all, will be using it for the delivery of real-time business data to multiple interested recipients per group, not a few packets per hour to a single recipient, and they size their TCAMs accordingly. Even that wouldn’t be a problem if the solicited-node groups were exempted from MLD, so that switches had no need to keep track of them — because the actual NDP traffic is insignificant, it’s the MLD reports for the groups that are the problem.

        As for the IETF: I was there, sitting (at least virtually) next to Noel Chiappa as he and others, probably including you. argued about the locator/endpoint ID split. But I moved on to an operational role (rather than research) and wasn’t involved when NDP was designed (nor would I have known, at that time, how horrible the “solicited node multicast” hack would turn out to be — it would be substantially better if there were only 256 of them rather than 17 million). Presumably the people who actually were designing it didn’t expect this model of many IP addresses per end station to become a reality, or they might have made a different choice.

      • Phil Karn says:

        Once again Garrett you are blaming IPv6 for giving the level 2 people something they specifically asked for: a handle to segregate neighbor discovery traffic so it wouldn’t have to flood an entire subnetwork as IPv4 ARP does.

        Let’s go back to the basics: “End-to-End Arguments In System Design”, IMHO the single most important networking paper ever written. (It should be required reading for each new generation of network admins and designers). The basic principle, restated, is that many network functions can only be done properly at the endpoints so duplicating them within the network is either harmful or at best wasteful, *except for certain functions that can be justified as optional performance enhancements*.

        Multicast pruning within a bridged Ethernet is precisely one of those optional performance enhancements. It’s optional because the network will work just fine without it; just flood multicasts everywhere and the hosts (or their NICs) will simply ignore those they don’t want — something they have to do anyway. Pruning trades significantly increased switching complexity for decreased transmission loading, a performance enhancement. And as you have demonstrated so clearly this tradeoff is not always worthwhile, especially as many networks have grown in aggregate transmission speed much more than they’ve grown in switch horsepower. (It also demonstrates the wisdom of keeping networks as simpleminded as possible — again the end-to-end argument.)

        It is really the subnetwork designer/operator’s own fault that they implemented a multicast scheme that falls over under even light use when there are many ways they could make it far more robust. If multicast pruning capability is so expensive that it must be limited, then simply observe the amount of traffic to each multicast group and reserve the pruning capability to the busiest N groups. You yourself say that you’re happy with the way IPv4/ARP works, so obviously you don’t need the transmission optimization that pruning neighbor discovery traffic provides. So why are you using it?

        This is not the first time that widespread use of multicasting has revealed problems in subnetwork implementations. AT&T U-verse uses IP multicasting for high speed video streams, and many users have discovered that many WiFi base stations tend to crumble when exposed to such traffic even if it has no clients listening to it. (I can explain the precise details to anyone who wants, but they’re not really relevant here). Suffice it to say that it’s not fair to blame a new technology like IPv6 for taking vendors at their word when they advertise a feature like subnet multicasting that is poorly implemented and not ready for prime time.

      • Phil,

        I don’t have a choice about clients announcing their solicited-node multicast group memberships in MLD. That’s the way you protocol designers made it. And it worked fine so long as hosts only belonged to one such group. It doesn’t work when every host belongs to eight such groups, and it totally falls over when you have some hosts that belong to hundreds. These are not ginormous subnets here; the biggest is maybe 300 hosts or so, and they are all within the confines of a single building.

      • And in any case, I don’t care what “the layer 2 people” asked for. I am a small-time network operator, not a residential ISP, and definitely not a hardware vendor. What Juniper and Cisco may have told you back in 1997 is of no relevance to me, and will not give me my lost night’s sleep back.

      • Phil Karn says:

        Indeed you don’t control whether your users transmit MLDs. But you *do* have a choice as to whether you (i.e., the L2 switches) even pay attention to them. After all, they’re just one type of multicast packet, and pruning the distribution of *any* multicast packet is merely an optional performance enhancement to save transmission bandwidth. You’re not required to do it, and in that case you don’t even have to look at the MLDs.

        No, you might not care what the L2 people asked for a decade or two ago. I’m simply explaining the situation so you can direct your ire toward the right target. And it’s not IPv6.

      • You write as if the L2 and L3 layers were somehow separate. Nobody I know of builds a single-building, single-organization network like that. The same hardware, sharing the same CPUs, does both. And then you start getting into the TCAM exhaustion issue. Yes, the hardware vendor could have put bigger TCAMs in — but they rationally concluded that people didn’t need that many multicast groups. Then the OS vendors decided to start using temporary addresses promiscuously and destroyed that assumption.

      • Phil Karn says:

        Oh, I dunno, I built several networks like that, with fully distinct bridges and routers. It was important to maintain at least some neutrality toward the multiple protocols that had to co-exist at that time. Every so often I’d run tcpdump on one of my machines to look for anything odd, especially unexplained or unnecessary multicast/broadcast traffic, and try to do something about it before it actually became a problem.

        Of course it’s been a while. I understand vendors do often cut corners and operators and their managers are often too busy to pre-emptively find and fix problems so they often get surprised. But it’s not like IPv6 was invented yesterday, and vendors do respond to complaints — at least those who want to stay in business. So go jump on Juniper until they come up with a fix better than “stick your head in the sand re IPv6”, because it’s here to stay.

      • Phil Karn says:

        Oh, I forgot to address your concern about tracking down those who trigger DMCA notices.

        While this is probably a network administration headache that will resolve itself in time (what self-respecting “warez” and movie trader doesn’t encrypt and/or use TOR?) it’s also a red herring. You, the administrator, can still easily tell who used what address at what time: just collect MAC/IPv4/IPv6 associations. Even private IPv6 addresses can be tracked this way, for better or worse, by simply monitoring DAD assertions.

      • I don’t have the ability to do that now, whereas I do have the ability to collect (and delete) DHCP logs. It may only be a Small Matter of Programming, but it’s not high on my agenda at the moment.

      • Phil Karn says:

        I’m pretty sure there’s open source software already written that will do this monitoring; you don’t have to write anything.

    • Brian Candler says:

      “IPv6 privacy addressing is a feature, not a bug. Snowden’s revelations should have made the utility of this feature obvious.”

      Your IPv6 /64 is inherently just as trackable as your IPv4 /32. It says what network you’re plugged into, and therefore where you are. The globally unique token in the lower 64 bits was a problem introduced by SLAAC.

      ARP didn’t need replacing. DHCP didn’t need replacing. All we wanted was longer addresses.

      • Phil Karn says:

        No. The standard form of SLAAC uses your MAC address to produce the 64-bit host part of your address. Insert 0xfffe in the middle and flip the 0x20 bit in the first byte. MAC addresses are (supposed to be) stable and globally unique, so this makes it easy to track you as you move from one network to another. Randomly generating those 64 bits avoids that problem.

        Even knowing what network I’m on doesn’t uniquely identify me if I’m sharing it with a lot of other users.

        DHCP works as well as it ever did, but SLAAC gives you the opportunity to *eliminate* it. This can be a godsend for admins.

        IPv6 neighbor discovery in IPv6 is really the same thing as ARP. Only the name and the implementation details differ (it’s folded into ICMP instead of being encoded as a parallel network layer protocol) and in its use of Ethernet multicast rather than broadcast — a feature specifically requested by the vendors and operators of large bridged/switched Ethernets, btw.

      • Brian Carpenter says:

        Well, those of us who nearly lost their minds in the 1980s due to ARP-induced broadcast storms certainly *thought* ARP needed replacing. It’s sad that in a particular configuration the cure is worse than the disease, but the vast majority of IPv6 networks do not have this problem. And DHCPv6 is what it says on the can – DHCP for v6.
        Actually we did want more than longer addresses, and plug-and-play address configuration was on the list:

  8. Charles says:

    Well, a work around will have to be found. Privacy IPv6 adresses do exist for a reason, not because of dial up nostalgia. A unikely identifiable IP address accross all networks (the MAC part) is a privacy nightmare, a cookie on steroid. I would not switch to IPv6 as a user if I can’t have them.

    • Brian Carpenter says:

      Firstly a word of support for the originators of the current IPv6 privacy address mechanism: it was first documented in 2001 (RFC3041) when few people worried about privacy at all. But as already noted, if you want stable privacy addresses, RFC7217 will be the answer. However, in an enterprise or campus network, as the blog says, managers want to know which machine is which, so stateful DHCPv6 will prevail. That’s just repeating the history of IPv4.

    • Mike Johnson says:

      On a side note, useful for servers at least, newer Linux versions support tokenized IPv6 interface identifiers. (

  9. Zamolxes says:

    I am curious how does Juniper respond to this post, from product team to the Advanced Technical Assistance. Clearly there are product issues here but it seems to me there are major knowledge issues, both in terms of technology itself and in terms of the current status of IPv6 deployment in large environments. I designed and deployed very large IPv6 infrastructures without any problems of this sort but this is a common problem for IPv6, new kid on the block gets blamed for everything.

  10. Bill Paul says:

    Not to pick nits, but the sample MAC address you used in the ARP discussion, 01:23:45:67:89:ab, is not a valid as a station address. An ethernet address with the least significant bit in the first octet set (01) is a multicast group address, not unicast, and station addresses must be unicasts. I know you just made it up on the spot, but I’ve seen developers working on prototype network gear using sample MAC addresses like 01:02:03:04:05:06 in their test labs and then wondering why things don’t work right.

  11. Pingback: Newsletter: September 7, 2014 | Notes from MWhite

  12. Mike O'Dell says:

    And now for another opinion.

    The notion that IPv6 was somehow *intended* to banish NATs is not factually correct; there was never any formal design requirement to make NATs illegal, impossible or unnecessary, even if that may have been the intention of a vocal subculture. The realization is that designing networks which can continue to scale *requires* the ability to create abstraction boundaries, *opaque* abstraction boundaries, which completely encapsulate structure, thereby limiting the necessity of propagating irrelevant (or private) details over arbitrarily large distances.

    Everywhere else in computer science, abstraction boundaries are revered for their power to provide separation of concern and bound the propagation of complexity. Their violation considered to be prima facie evidence of bad design. Yet incomprehensibly, using abstraction boundaries as tools in designing scalable networks has never been considered an important issue in the IETF.

    NATs are *not* the Work of the Devil. NATs as known in IPv4 and IPv6 are the result of a desperate effort to create usable abstraction boundaries. The offensive difficulty of implementing usable, comprehensible abstraction boundaries in IPv4/6 is a direct demonstration of the architectural problems of IPv4 and IPv6.

    The ugliness of NATs is not the cause of the problems, it is the *result* of deeply-rooted problems relating to addresses and what they mean. IPv6 didn’t fix any of these problems – it actually made them much worse. Why? If nothing else, because having 2^96 more of something already not understood does not make matters better.

    • Phil Karn says:

      I beg to differ, Mike. The primary purpose of IPv6 was to create a new network layer with a much larger address space. A secondary purpose was to “clean up” IP (mainly by removing) some features of IPv4 that turned out to be mistakes, such as router fragmentation. IPv4 NAT exists primarily because of the severely limited IPv4 address space; ergo, IPv6 eliminates the need for NATs (though not the ability to re-implement them if you’re misguided enough).

      Along with the enlargement of the address space, IPv6 allows precisely the kind of layered abstraction you desire in its management. By making a clear distinction between the network and host parts, the management of the Internet as a whole becomes decoupled (abstracted) from the management of a particular subnet. In particular, because the host part is so large you no longer have to request additional globally-routable prefixes just because you can’t address any more hosts within the ones you have (though you may want to do so for other reasons). Or use a NAT.

      Let’s go back to basics. What is the purpose of an address? They play two similar but not identical roles. Servers use them as (fairly) well-defined network identifiers so clients can send them packets, and clients use them so servers know where to return their responses. In fact, the only reason clients even need addresses at all is because, for many good reasons hashed out decades ago, the Internet fabric is connectionless. If it had been built with virtual circuits throughout, clients would not need addresses at all because all of the information needed to return a reply to one would be in the switches.

      And even in a connectionless network, there is no need for a client address to remain stable longer than the lifetime of its transactions with its servers.

      Note that I am not saying that clients should not identify themselves to their servers, only that this is an end-to-end function. There is no real need to identify a client to the entire network as well, and as Snowden has made abundantly clear the fact that so many networks do has been exploited to the hilt by NSA metadata mining, creating an absolute disaster for personal privacy that is very hard to fix without fundamental changes to the architecture such as randomly changing private addresses.

      • Brian Carpenter says:

        As a matter of fact, that’s why Jon Crowcroft’s idea of Sourceless Network Architecture is so cute. You don’t actually *need* the source address in the layer 3 header, even in a connectionless network. It can be conveyed encrypted for the destination to decrypt.
        But I fear we are drifting off topic.

    • Phil Karn says:

      Oh, I should add that another of the mistakes in IPv4 that IPv6 was designed to fix was the use of subnetwork broadcasting when subnetwork multicasting would work as well. ARP was *the* poster child. Neighbor discovery was designed as it was precisely to allow the subnetworks to abstract and generalize this function so that they could scale without being overwhelmed by the traffic.

      Ironic. Perhaps the level 2 people should have been more careful in what they wished for.

  13. My view of the world is that an L2 domain is a single point of failure. If you are trunking L2 to 1 pairs of core switches, you have 1 fault domain. And if anything goes wrong anywhere, all your eggs are in 1 basket. If 1 VLAN blows up, they all blow up…. because you are L2 trunking everything everywhere. Many folks have moved to creating L3 boundaries at the edge. Meaning, a single pair of switches in a closet share an L2 broadcast domain and are therefore the a single fault domain, and that fault domain is not shared outside of that pair switches / floor / building / network block. Do more L3 and less L2.

  14. Brian Carpenter says:

    After studying a couple of RFCs there’s something I don’t understand in the following text: “” Furthermore, MLD packets are required by the protocol to be transmitted with a “router alert” option, which causes routers to bounce these packets from hardware forwarding into software, meaning that while the flooding of ARP broadcasts can be fast-pathed and are usually implemented in silicon, MLD listener report multicasts must be slow-pathed “”
    As L2 multicasts, an L2 switch shouldn’t have any idea that what it’s replicating is an MLD packet, so why is it any different at L2 from flooding an ARP request? That should be quite independent of the L3 code also grabbing the packet, recognising that it’s an IPv6 packet (with a link-local source address, BTW) and the seeing the hop-by-hop header and the router alert option. By the time that happens the L2 replication should be ancient history. Also, BTW, rate-limiting hop-by-hop options is strongly advised these days (e.g. It is fail-safe in this case since the scope of neighbor discovery is a single L2 domain, so these packets never need to be forwarded at L3.

  15. Erick Lobo says:

    Any application or service depended on the IPv6? Services that were interrupted or otherwise affected by disabling IPv6 in the network?

    • Nope. Nobody was using IPv6 for business functions, except incidentally as a result of getting AAAA answers from the DNS, so users were not impacted. It probably actually improved the browsing experience for some users, by getting them to a nearer CDN node.

      • Brian Carpenter says:

        Hmm. I think the major CDNs are pretty much dual stacked everywhere now. Sitting in NZ I get excellent IPv6 response times from Cloudflare in Sydney AU, for example.

      • Yes, but our IPv6 goes through a tunnel to New York City, which means that the “local” v4 CDN nodes are in general much closer, speed-of-light-wise, than the v6 CDN nodes are.

      • Phil Karn says:

        I’m actually not sure where you are, but I’d guess MIT, right? Something tells me the average MIT user probably doesn’t have a huge problem obtaining needed IPv4 addresses and/or running protocols that get stomped on by NATs beyond his control.

        The rest of the world outside MIT (yes, there is one) isn’t quite so lucky.

        A majority of my own traffic has been IPv6 for quite some time. Even via HE’s tunnels, which I still prefer despite both TWC and AT&T now providing “native” IPv6, I almost always get IPv6 service at least as good as IPv4 and often a little better.

      • Our v6 service is from Hurricane Electric, but of course for v4 we do have three /16s of our own, a “swamp” /24 (currently unrouted), and three /16s out of 18/8 so no, our users are not in any way hurting for address space. Occasionally I have to yell at them to garbage-collect their old addresses because that’s easier than widening a subnet mask; I’m not expecting that will ever be necessary in IPv6 land where all subnets are /64s….

      • Phil Karn says:

        Of course, I will admit that the reason I see such good service from HE’s IPv6 tunnels is that their tunnel endpoint in Los Angeles is very close to the CDNs and to where all the San Diego ISPs peer with them, each other and the outside world. There’s no, uh, escape from LA, so if it ever gets hit San Diego is toast even for most intra-city communications.

  16. Scott Hogg says:

    One thing you might want to consider is that one of the nodes on your network are using The Hacker’s Choice IPv6 Attack Toolkit flood_router26 program, EvilFOCA, Si6 Toolkit, Scapy, or one of a variety of other tools to generate many rogue ICMPv6 type 134 Router Advertisement (RA) packets. This could be an explanation why many of your windows hosts have so many SLAAC addresses. Patched Windows 7 and Windows 8 systems would limit themselves to 100 privacy and 100 temporary addresses. There are some good papers on switch vendor’s sites describing how to limit these and implement first hop security best practices. I have heard that there is a good book titled “IPv6 Security” that covers such topics.

  17. Rob Seastrom says:

    Garrett, I haven’t seen a diagram of your network but I am suspicious that the EX4200 in question was somewhere that it oughtn’t have been, architecturally speaking. I wouldn’t be deploying those in a position where they served more than a single rack worth of stuff, let alone sharing a layer 2 broadcast domain with things elsewhere in the network. I’ll echo the folks that say you should have smaller subnets – in fact, it’s IPv4 not IPv6 that’s keeping you from going to one subnet per office. Layer 3 scales… layer 2, well, how big you make it depends on how lucky you’re feeling (I bet you’re not feeling terribly lucky just now). In my world, STP ideally only gets used for one thing – triggering bpduguard on other switches when someone tries to get helpful and plug things in funny.

    • rs, I live in a very very different world. It’s all I can do to get my users to understand that they need to tell me when an office (or a network drop) changes hands so that I can update my configuration to reflect who and what is supposed to be there. One subnet per office, or even per rack, is total fantasy — and would be nearly impossible for me to manage in any case. We typically have a single switch (well, stack) per machine room, and another one per floor of offices, and I don’t think that’s remotely unusual. Of course, the users all have little desktop switches in their offices, so I need the spanning-tree to break the loops that they inevitably create. (Found one just today — “hey, why is that port blocking?” — as I was updating the configuration to set no-root-port everywhere, which I should have done long ago.)

    • Phil Karn says:

      You certainly could use very small IPv6 subnets, but why? If the problem is that the L2 switches can’t handle the necessary number of multicast groups for neighbor discovery, then don’t use them. Just let those multicasts flood the subnet. With nearly every host link running at 1 Gb/s or faster, and with nearly every host having smart NICs with hardware address filtering, is there really a need to prune this traffic that uses so little capacity anyway?

      • That would be fine if MLD snooping were not an all-or-nothing affair. Perhaps Juniper and other vendors will eventually remedy this by making MLD snooping configurable with an access list to determine which groups are snooped and which are not. However, if MLD snooping actually works, then it can be a valuable debugging tool for locating those problem hosts that are members of 300 groups.

      • Phil Karn says:

        Why not *disable* MLD snooping by default and enable it as needed? If your multicast traffic is just ARP and IPv6 ND, then it won’t really hurt anybody to flood it all. If somebody wants to run a high rate multicast stream you could configure MLD for that multicast group.

        Who cares if everybody on the subnet sees everybody’s IPv6 ND traffic? Everybody’s got a gigabit interface anyway, right? The host NICs will just ignore what they don’t want. If multicast traffic were to take up a significant fraction of a gigabit, only then would you have a problem.

        The only caveat is to keep your WiFi base stations off any subnet carrying a lot of multicast traffic unless you’re sure they do MLD correctly. Otherwise they’ll transmit over the radio every multicast they see on the Ethernet side, and they’ll do it at a low fixed data rate that can easily saturate the channel. (This is the AT&T Uverse problem. It carries HDTV as multicast IPv4 streams at about 6 Mb/s or so.)

        If the associative memories used to hold multicast address tables in switches are going to remain small for the foreseeable future (I don’t know if they are or not), then the software really ought to be smart enough to use those limited slots for the most active groups and let the less active addresses (which would include IPv6 ND) flood everywhere. Caching is a fairly established technique in computer science, no?

      • I’m sure my users will tell me when they start sending high-rate multicast streams — NOT!

      • Phil Karn says:

        So then you build a monitor to let you know automatically! Easy enough to do, it can run on any computer attached to the network. You can do the same thing to collect MAC/IP address pairs if you really need to. I’m pretty sure software already exists for this purpose.

      • Yeah, sure, in my Copious Free Time I’ll arrange for one machine to monitor all of my vlans and figure out if there is “enough” IPv6 multicast traffic for me to take countermeasures. First I’ll have to figure out how much that is, given link constraints that range from 10 Mbit/s shared to 10 Gbit/s point-to-point links….

      • Phil Karn says:

        Whether your multicast traffic is enough to take action is really very easy — you monitor its rate as a fraction of the speed of the slowest link in the bridged network.

        You have bridged networks that include both 10 Gb/s and 10 Mb/s in the same network, i.e., broadcast domain, large enough to have enough multicast traffic to cause problems on the 10 Mb/s links? That’s just poor network engineering. Break them up into smaller broadcast domains, and group those users (if any) who actually make heavy use of application multicasting.

        I said “if any” because multicasting is still largely stillborn except for a few network-level activities like neighbor discovery (including ARP) and resource discovery (e.g., Bonjour/Zeroconf and UPnP), all of which are limited to the local broadcast subnet. Hardly anybody but a few research projects use it at the application level, which is actually a shame. The judicious use of multicasting could save a lot of unicast traffic.

      • I don’t have a lot of multicast traffic on the same network as the Lisp Machines — but that’s not a result of good engineering, simply good luck that there were enough robotics people to give them their own subnet. (Robotics people use multicast heavily, although not IPv6 multicast just yet.)


  19. Pingback: Lazy Reading for 2014/09/14 – DragonFly BSD Digest

Comments are closed.