All you folks who followed my blog for food porn are going to want to skip this one.
The past ten days have been pretty horrible for me at work, thanks to the combined effect of five different network issues. One particularly difficult one to track down has been affecting the very access switch that serves my own office, and that makes it particularly frustrating (especially when trying to fight other fires). The particular symptom that we observed was that traffic would slow down for a short period — about 15 seconds, just long enough to notice but not long enough to actually track down the source of the problem.
It was clear from looking at interface statistics that there was some sort of broadcast or multicast storm going on. Early on, one particular network drop looked suspicious: when the slowdown occurred, we could see that the switch port was receiving a high rate of traffic (hundreds or even thousands of packets per second) and that these were being replicated to all or nearly all the other ports on that switch. When other switches started to trigger the same sort of alerts in our monitoring system, I physically unplugged that drop and things appeared to get better — but I still had no idea why. And things only appeared to get better: there were still slowdowns; they just weren’t as severe (and thus as noticeable) as before.
The access layer in our network is composed of Juniper EX4200 switches, and thanks to Junos’s FreeBSD heritage, they have much better observability than most other switches. In particular, you can run
start shell from the Junos command line and get a standard Unix shell (well,
csh actually, which while non-standard is “standard enough” for most usual administrative tasks). There are some limitations: Juniper’s kernel will only execute signed binaries, for example, so you can’t install your own software on the switches (although Juniper offers an SDK for some platforms). But Junos includes a number of standard FreeBSD utilities, including
tar, and (relevant for this discussion)
top. So I was able to log in to the problem switch and monitor exactly what was going on CPU-wise. Here’s what a normal EX4200 looks like:
last pid: 26547; load averages: 0.08, 0.12, 0.14 up 110+05:09:57 22:16:09 48 processes: 1 running, 47 sleeping CPU states: 5.1% user, 0.0% nice, 4.6% system, 0.2% interrupt, 90.0% idle Mem: 390M Active, 67M Inact, 47M Wired, 190M Cache, 110M Buf, 286M Free Swap: PID USERNAME THR PRI NICE SIZE RES STATE TIME WCPU COMMAND 1374 root 2 44 -52 72584K 19100K select 237.3H 3.66% sfid 1376 root 1 8 0 86884K 33716K nanslp 129.9H 1.12% pfem 1373 root 1 4 0 68052K 11804K kqread 53.9H 0.00% chassism 1424 root 1 4 0 10204K 6084K kqread 30.9H 0.00% mcsnoopd 1393 root 1 4 0 23896K 15808K kqread 560:19 0.00% eswd 1402 root 1 96 0 28968K 14992K select 420:57 0.00% mib2d 1422 root 1 4 0 17320K 9716K kqread 273:08 0.00% lldpd 1375 root 1 4 -20 14240K 8588K kqread 240:56 0.00% vccpd 1401 root 1 96 0 22184K 16388K select 215:18 0.00% snmpd 1426 root 1 96 0 12356K 6940K select 163:18 0.00% license-ch
This is just a standard FreeBSD
top command, from about the 7.0 era, so it accepts the S and H flags to show system processes and threads, respectively, but in this case that didn’t add any useful information. When things were hosed, yes, there was a process (mcsnoopd, the multicast snooping daemon) that was taking up a lot of CPU, but it was only getting 30% of the processor — the other 60% was being absorbed by “system” time and not attributable to any particular process or thread.
Seeing this, but being frustratingly unable to truly identify the source of the problem, I opened a JTAC case. The technician gave me a script to run, which totally failed to identify the problem. (JTAC’s debugging scripts are apparently spread around cargo-cult fashion, and the JTAC staff don’t actually understand what they do or have any idea how to debug them — in this case, the script I was given would never have found this issue because it was looking at the wrong place in the output of
top -b — but it was impossible to explain this to the tech over email. The script itself was clearly a horrible hacked-together farrago of
awk (twice) and
sed (twice) to do something which could have been easily done in a single line of
awk script by someone who was actually competent.)
Eventually we had a phone call and for once the Juniper “secure meeting” functionality worked. (I keep an old Thinkpad running Windows in my office for this purpose, since “secure meeting” doesn’t usually work on any other platform.) I was able to convince the tech (after two days of waiting for this script to find something) that it wasn’t going to work, and he moved on to another debugging step — one that should have occurred to me, but when you’re under stress you often don’t think of the obvious. One of the other FreeBSD tools that Junos includes is
tcpdump (there’s even an interface to it in the Junos command line), so you can actually take a packet capture directly on the switch, save it to the switch’s internal flash, and then
scp it somewhere else for analysis. The one drawback with this is that it can only see the packets that hit the switch’s CPU — anything that’s switched in hardware doesn’t get seen by BPF — but that’s actually an advantage when you’re trying to identify something that’s clobbering the switch’s CPU.
We took a packet capture, and I uploaded it to JTAC so the tech could look at it while I was still on the phone, but there wasn’t anything that looked like it could possibly be the source of the problem. So we ended the call with the conclusion that I would continue to watch the switch and take a packet capture when the problem was actually happening. It didn’t take long — as it turned out, the problem condition was actually quite deterministic, and repeated every 125 seconds! So I had no trouble getting those captures, and after uploading them to JTAC I opened each one up in Wireshark and had a look for myself.
What immediately caught my attention was a long stream of IPv6 multicasts that overwhelmed all other traffic for about ten seconds — suspiciously similar to the 15-second period of the slowdown I had been watching in
top. I saw that these were all IPv6 Multicast Listener Discovery packets, which are part of the ICMPv6 protocol and take the place of the IPv4 IGMP protocol — they are used by routers to identify which hosts are members of which multicast groups, so that only desired multicast traffic is forwarded down a network interface. I could see that the burst of multicasts was immediately preceded by an all-groups solicitation coming from one of our core switches, so that all seemed normal, and the solicitation itself actually said “please splay your responses over the next 10,000 milliseconds”, which explained why the burst lasted for almost exactly that length of time. The multicast groups being reported, in nearly all cases, were the so-called “solicited node” multicast groups, which are part of IPv6’s (mandatory) Neighbor Discovery Protocol — the way nodes on a network find out their neighbors’ layer-2 addresses, analogous to ARP in IPv4. If this was just normal behavior, why exactly was it causing such problems? Surely Juniper had plenty of customers who were using IPv6 in production networks, so they must have tested this software. And why was the mcsnoopd process involved for IPv6 traffic? I started reading up on tuning parameters for MLD, reasoning that maybe the defaults were just wrong for a network of our size.
We had done a software update on all of our access switches during the spring. One of the new features added in Junos 12.1 (to which we had upgraded) was support for “MLD snooping”. This was the IPv6 analogue to “IGMP snooping”, which we did have configured. (In IGMP snooping, a layer-2 switch like ours listens to the layer-3 IGMP traffic, and even forges its own IGMP messages to make them look like they came from a router, to determine on a port-by-port basis which multicast groups are subscribed. Otherwise, it would have to flood all multicasts to all ports, which would take far more bandwidth and more interconnect resources on the switch, so even though it’s a “layering violation”, it’s a necessity in all but the smallest networks. MLD snooping does exactly the same thing, but for IPv6.) I thought that maybe not having MLD snooping configured was causing some sort of problem — perhaps the mcsnoopd process was taking an abnormal amount of CPU when faced with MLD packets that it didn’t know what to do with, or perhaps (since MLD packets themselves are also multicast) it was just doing the flooding in software rather than allowing the switch hardware to do it. In any event, I turned on MLD snooping for all VLANs on the switch, and the CPU spikes simply stopped. Coincidence? I’ll wait and see — but since more and more machines have IPv6 on by default, I’ll be deploying MLD snooping everywhere I can (regardless of whether it really helps my issue or not).
So I went home, and made dinner (well, actually, a bowl of macaroni and cheese). But I was still thinking about this issue at work. The JTAC technician sent me email, after I had left work, pointing to an issue on some HP machines running Windows with a bad NIC driver. We don’t have many HP machines, so I initially dismissed the idea, but since I could run Wireshark just as easily at home as in the office, I scp’ed the trace files from my workstation and had a look. Opening up one of the traces in Wireshark, I used Statistics => Endpoint list => Ethernet to find which Ethernet addresses had sent the most traffic, and one jumped out at me immediately: a Dell machine on my network had sent the second-most number of multicast or broadcast packets in this several-minute capture. It was second only to the core switch that was busily ARPing away for non-existent IPv4 addresses. (We have a lot of address space, and it’s all on the public Internet, so we get scanned pretty much continually.) I logged in to the switch, and located the port where it connected, and lo and behold, it was on the same port as I had originally suspected!
I used Wireshark’s filtering capability to get an idea of what the machine was, and what exactly it was doing sending all of those packets. That told me the identity of the machine — it was a Windows machine, and Windows loves to broadcast its identity to all and sundry — and more to the point, I saw that it was responsible for the vast majority of those multicast memberships: 350 different multicast groups in all, nearly all of which were “solicited node” groups for different IPv6 “privacy” addresses.
Now for a digression on IPv6 addressing. An IPv6 address is 128 bits long. That is enough addresses to give billions of addresses to every cell in the body of every human on earth. Of course, there’s no reason you’d do that, and a lot of the address space is “wasted” by giving it a structure that makes it practical to actually route IPv6 packets on a global scale. By convention, the top 64 bits of an IPv6 address identifies a specific network, and the bottom 64 bits are an “interface identifier”, which identifies the particular device attached to that network. Some bright light looked at this and said, oh, gee, all those users are used to dialup in the IPv4 Internet where they get a different address every time they turn their computer on, so we’d better find a way to emulate that in IPv6. Unfortunately, they did, and thus “privacy” addresses were born — where the device identifier (technically called an “interface ID” since it’s usually specific to the NIC) is just randomly generated, and changes every so often. That way, nobody will ever have to worry about some bad guy on the IPv6 Internet identifying them by their MAC address. (Duh, they’ll just use the “network” part of the IPv6 address instead!) Of course that will also ensure that nobody will ever have to care about making the IPv6 reverse DNS work properly, either, since nobody will use the same address from one day to the next.
Microsoft turned this “feature” on by default in recent versions of Windows. (Apple turned it on in Mac OS and iOS as well, but that’s a rant for another day.) Apparently the Windows implementation is more aggressive, or Windows applications are more likely to hold connections open, because there are lots of reports of Windows machines having hundreds of these “privacy” addresses simultaneously — and that’s what appears to be happening on our problem machine: every time the router sent out a “please tell me all your memberships in the next 10 seconds” message, the Windows box would reply with 30 MLD responses a second for ten seconds. That doesn’t seem like much, but now that I’ve started turning MLD snooping on, I can see that there are other VLANs in my network that look like they are having the same problem — so I have my work cut out for me. Luckily, there’s a really simple set of commands to completely disable “privacy” addresses on Windows machines (for whatever reason, Microsoft doesn’t provide a GUI for this), so now that I know what to look for, I can identify the problem machines and get their owners to fix them.